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Abstract 

We attempt to recover an ri-dimensional vector observed in white noise, where n 
is large and the vector is known to be sparse, but the degree of sparsity is unknown. 
We consider three different ways of defining sparsity of a vector: using the fraction of 
nonzero terms; imposing power-law decay bounds on the ordered entries; and controlling 
the £p norm for p small. We obtain a procedure which is asymptotically minimax for 
£^ loss, simultaneously throughout a range of such sparsity classes. 

The optimal procedure is a data-adaptive thresholding scheme, driven by control of 
the False Discovery Rate (FDR). FDR control is a relatively recent innovation in simul- 
taneous testing, ensuring that at most a certain fraction of the rejected null hypotheses 
will correspond to false rejections. 

In our treatment, the FDR control parameter q„ also plays a determining role in 
asymptotic minimaxity. If g = limg„ S [0,1/2] and also g„ > 7/log(n) we get sharp 
asymptotic minimaxity, simultaneously, over a wide range of sparse parameter spaces 
and loss functions. On the other hand, q = \imqn G (1/2, 1], forces the risk to exceed 
the minimax risk by a factor growing with q. 

To our knowledge, this relation between ideas in simultaneous inference and asymp- 
totic decision theory is new. 

Our work provides a new perspective on a class of model selection rules which 
has been introduced recently by several authors. These new rules impose complexity 
penalization of the form 2 • log( potential model size / actual model size ). We exhibit 
a close connection with FDR-controlling procedures under stringent control of the false 
discovery rate. 
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1 Introduction 



The problem of model selection has attracted the attention of both apphed and theoretical 
statistics for as long as anyone can remember. In the setting of the standard linear model, 
we have noisy data on a response variable which we wish to predict linearly using a subset 
of a large collection of predictor variables. We believe that good parsimonious models can 
be constructed using only a relatively few variables from the available ones. In the spirit 
of the modern, computer-driven era, we would like a simple automatic procedure which is 
data adaptive, can find a good parsimonious model when one exists, and is effective for 
very different types of data and model. 

There has been an enormous range of contributions to this problem, so large in fact 
that it would be impractical to summarize here. Some key contribu tions, nientioried further 
below, include the AIC, BIC, and RIC i nodel selection proposals dAkaikel . Il97.4 iMallowsl . 



197,4 Schwarj . 197Sl : Foster and Georg^ . Il994h . Key insights from this vast literature are 



The tendency of certain rules (notably AIC), when used in an exhaustive model se arch 
mode, to include too many irrelevant predictors - Breiman and FreedmanI (|l983l l: 



• The tendency of rules which do not suffer from this problem (notably RIC) to place 
evidentiary standards for inclusion in the model that are far stricter than the time- 
honored 'individually significant' single coefficient approaches. 

In this paper we consider a very special case of the model selection problem in which 
a full decision-theoretic analysis of predictive risk can be carried out. In this setting, 
model parsimony can be concretely defined and utilized, and we exhibit a model selection 
method enjoying optimality over a wide range of parsimony classes. While the full story 
is rather technical, at the heart of the method is a simple practical method with an easily 
understandable benefit: the ability to prevent the inclusion of too many irrelevant predictors 

- thus improving on AIC - while setting lower standards for inclusion - thus improving on 
RIC. The optimality result assures us that in a certain sense the method is unimprovable. 

Our special case is the problem of estimating a high-dimensional mean vector which is 
sparse, when the nature and degree of sparsity are unknown and may vary through a range 
of possibilities. We consider three ways of defining sparsity and will derive asymptotically 
minimax procedures applicable across all modes of definition. 

Our asymptotically minimax procedures will be based on a relatively recent innovation 

— False Discovery Rate (FDR) control in multiple hypothesis testing. The FDR control 
parameter plays a key role in delineating superficially similar cases where one can achieve 
asymptotic minimaxity and where one cannot. 

To our knowledge, this connection between developments in these two important sub- 
fields of statistics is new. Historically, the multiple hypothesis testing literature has had 
little to do with notions like minimax estimation or asymptotic minimaxity in estimation. 

The procedures we propose will be very easy to implement and run quickly on computers. 
This is in sharp contrast to certain optimality results in minimaxity which exhibit optimal 
procedures that are computationally unrealistic. Finally, because of recent developments 
in harmonic analysis - wavelets, wavelet packets, etc. - these results are of immediate 
practical significance in applied settings. Indeed, wavelet analysis of noisy signals can 
result in exactly the kind of sparse means problem discussed here. 

Our goal in this introduction is to make clear to the non decision-theorist the motivation 
for these results, the form of a few select results, and some of the implications. Later sections 
will give full details of the proofs and the methodology being studied here. 
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1.1 Thresholding 

Consider the standard multivariate normal mean problem: 

i.i.d. 



iV(0,l) 



,n. 



(1.1) 



Here (T„ is known, and the goal is to estimate the unknown vector ^ lying in a fixed set 
@n- The index n counts the number of variables and is assumed large. The key extra 
assumption, to be quantified later, is that the vector n is sparse: only a small number of 
components are significantly large, and the indices, or locations of these large components 
are not known in advance. In such situations, thresholding will be appropriate; specifically, 
hard thresholding at threshold ic7„, meaning the estimate fi whose i component is 



Ai = miviit) 



\yi\ > tar, 
else. 



(1.2) 



A compelling motivation for this strategy is provided by wavelet analysis, since the 
wavelet re presentation of many smooth and piecewise smooth signals is sparse in precisely 
our sense dPonoho et al.l . liaQ.^ l. Consider, for example, the empirical wavelet coefficients 
in Figure 1(c). Model (|1.1|) is quite plausible if we consider the coefficients to be grouped 
level by level. Within a level, the number of large coefficients is small, though the relative 
number clearly decreases as one moves from coarse to fine levels of resolution. 




Figure 1: (a): sample NMR spe ctrum provided by A. Maudsley and C. Raphael, n = 1024, and 
discussed in iDonoho et al.l l|l995|) . (c): Empirical wavelet coefficients wjk displayed by nominal 
location and scale j, computed using a discrete orthogonal wavelet transform and the Daubechies 
near symmetric filter of order iV = 6. (d): Wavelet coefficients after hard thresholding using the 
FDR threshold described at 1)1. 8|l . with estimate d scale a = m e d. abs . dev. (wgk) / -674:5. a resistant 
estimate of scale at level 9 - for details on ct, see lDonoho et all l)l995j) . (b): Reconstruction using 
inverse discrete wavelet transform. 



1.2 Sparsity 

In certain subfields of signal and image processing, the wavelet coefficients of a typical object 
can be modeled as a sparse vector; the interested reader might consult literature going back 
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to Fieldl (IiQSTI'I. e xten ding through DeVore. Jawerth. and Luciei ( 1992h , Rudermanl ( 19941 ) , 
Simoncellil ( 199^ and Huang and MumfordI (|l999l l. A representative result was given by 
Simoncelh, who found that in looking at a database of images, the typical behavior of 
histograms of wavelet coefficients at a single resolution level of the wavelet pyramid was 
highly structured, with a sharp peak at the origin and somewhat heavy tails. In short, 
many coefficients are small in amplitude while a few are very large. 

Wavelet analysis of images is not the only place where one meets transforms with sparse 
cofficients. There are several other signal processing settings - for example acoustic signal 
processing - where, when vie wed in an appropriate basis, the underlying object has sparse 
coefficients (lBenedettol . ll99.'j ). 

In this paper we consider several ways to define sparsity precisely. 

The most intuitive notion of sparsity is simply that there is a relatively small proportion 
of nonzero coefficients. Define the io quasi-norm by ||x||o = #{i '■ Xi 7^ 0}. Fixing a 
proportion rj, the collection of sequences with at most a proportion rj of nonzero entries is 



4W = {/^ G 



l/^llo < Vn} 



(1.3) 



By analogy with night- skv images, we will c all nearly-black a setting where the fraction of 



non-zero entries 77 ~ ( Donoho et al. . 19921 ). 



Sparsity can also mean that there is a relatively small proportion of relatively large 
entries. Define the decreasing rearrangement of the amplitudes of the entries so that 

1/^1(1) > IA^I(2) > ■•• > lA^I(n); 

we control the entries by a termwise power-law bound on the decreasing rearrangements: 

|/i|(fc) <C-fc-^,fc = l,2,.... 

For reasons which will not be immediately obvious, we work with p = 1/(3 instead, and call 
such a constraint a weak-£p constraint. The interesting range is p small, yielding substantial 
sparsity. One can check whether a vector obeys such a constraint by plotting the decreasing 
rearrangement on semilog axes, and comparing the plot with a straight line of slope — 1/p. 
Certain values of p < 2 provide a rea,sonab le model for wavelet coefficients of real- world 
images; 



DeVore. Jawerth. and Luciei ( 1992h . 



Formally, a weak ip ball of radius r] is defined by requiring that the ordered magnitudes 
of components of fj, decay quickly: 



mp[r]] ={ne M" : < r/n^/P/c^^/^ for all A; = 1, . . . ,n}. 

Weak £p has a natural 'least-sparse' sequence, namely 

= r/n^/Pfc-^/P, k=l,...,n 
(and its permutations). We also measure sparsity using ip norms with p small: 



(1.4) 



(1.5) 



(E 

i=l 



That small p emphasises sparsity may be seen by noting that the two vectors 



(1,0,... ,0) 



and 



in 



-i/p 



,n 



-i/p) 



(1.6) 
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have equivalent ip norms, but when p is smah, the components of the latter, dense, vector 
are all negligible. Strong-^p balls of small average radius r] are defined so: 

n 
i=l 

If we refer to ip without qualification - weak or strong - we mean strong ip. 

There are important relationships between these classes. Note that as p ^ 0, the ip 
norms approach io: \\n\\p ||^||o- Weak-ip balls contain the corresponding strong ip balls, 
but only just: 

ipiv] C mp[r]] ipfli]], p'>p. 



1.3 Adapting to Unknown Sparsity 

Estima tion of sparse normal means over ip balls has been carefully studied in Donoho and Johnstonel 
( 1994d ). with the result that much is known about asymptotically minimax strategies for 
estimation. In essence, if we know the degree of sparsity of the sequence, then it turns 
out that thresholding is indeed asymptotically minimax, and there are simple formulas for 
optimal thresholds. 

Figure [U gives an example. One simple model of varying sparsity levels sets no = 
non-zero components out of n, < /3 < 1. Theory, reviewed in Section 3, suggests that a 
threshold of about = £7^ y^2(1 — (3) log n is appropriate for such a sparsity level. Suppose 
that P is unknown, and examine the consequences of using misspecified thresholds t^, ^ ^ fi. 
The solid lines in Figure|21show the increased absolute error incurred using when ti^^ is 
appropriate - the total absolute error is five times worse. For squared error, the misspecified 
threshold produces a discrepancy that is larger by nearly a factor of six. 

Typically we could not know in advance the degree of sparsity of the estimand, so we 
prefer methods adapting automatically to the unknown degree of sparsity. 



1.4 FDR-Controlling Procedures 

Beniamini and Hochberd()l99,^ ^ proposed a new principle for design of simultaneous testing 



procedures - control of the False Discovery Rate (FDR). In a setting where one is testing 
many hypotheses, the principle imposes control on the ratio of the number of erroneously 
rejected hypotheses to the total number rejected. The exact definition and basic properties 
of the FDR, as well as examples of procedures holding it below a specified level q, are 
reviewed in Section |21 In the context of estim ation, a thresholding p rocedure, which re- 



flects the st ep-up FDR controlling procedure i n lBeniamini and Hochbera (,1995. 1. was first 



proposed in Abramovich and Beniaminil (jl99fih . The procedure is quite simple: 



Form the order statistics of the magnitudes of the observed estimates, 

|y|(i) > \y\(2) > ■■■ \y\{k) > ■■■> |y|(„), (i-7) 

and compare them to the series of right tail Gaussian quantiles = anz{q/2 ■ k/n). Let kp 
be the largest index k for which |y|(fc) > t^; threshold the estimates at (the data dependent) 
threshold = tp, 

jvk, \yk\ > ip ,^ 
'^^''^= 0, else. ^'-'^ 
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Figure 2: Gaussian shift model witli n — 10, 000 and (t„ = 1. Tliere are no = n^^'^ — 10 non- 
zero components /i; = /-to = 5.21. Thus /? = 1/4. Stars show ordered data \y\{h) and solid circles the 

corresponding true means. Dotted horizontal line is "correct" threshold ii/4 = y^2(l — -j) logn = 
3.72, and dotted vertical lines show magnitude of the error committed with ti/4. Solid horizontal 

line is a 'misspecified' threshold txji — ^2(1 — i) logn = 3.03, which would be the appropriate 

choice for no = 7i^/^ = 100 non-zero components. Solid vertical lines show the additional absolute 
error suffered by using this misspecified threshold. Quantitatively, the absolute error ||// — using 
the right threshold is 14.4 versus 70.0 for the wrong threshold. For £2 error — /ijjj, the right 
threshold has error 38.8 and the wrong one has error 221.1. 



The FDR threshold is inherently adaptive to the sparsity level: it is higher for sparse signals 
and lower for dense ones. In the context of model selection, control of the FDR means 
that when the model is discovered to be complex, so that many variables are needed, we 
should not be concerned unduly about occasional inclusion of unnecessary variables; this 
is bound to happen. Instead, it is preferable to control the proportion of erroneously 
included variables. In a limited simulation study in the context of wavelet estimation, 



Abramovich and Beniamini (,1996 ) demonstrated the good adaptivity properties of the FDR 
thresholding procedure as reflected in relative mean square error performance. 

In order to demonstrate the adaptivity of FDR thresholding. Figure illustrates the re- 
sults of FDR thresholding at two different sparsity levels. In the first, sparser, case a higher 
threshold is chosen. Furthermore, the fraction of discoveries (coefficients above threshold) 
that are false discoveries (coming from coordinates with true me an 0) is roughly similar in 
the tw o cases. This is consistent with the fundamental result of lBeniamini and HochbergI 
;hat the FDR procedure described above controls the false discovery rate below level 
q whatever be the configurations of means /i G M", n > 1. 
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Figure 3: (a) 10 out of 10,000. /^j = 5.21 for i ~ l,...,no = 10 and = if i 

ll,12,...,n = 10,000. Data from model p.l|l . a„ — 1. Solid line: ordered data Solid 
circles: true unobserved mean value fj,i corresponding to observed Dashed line: FDR quantile 

boundary tk = z{q/2 ■ k/n), q = 0.05. Last crossing at kp = 12 producing threshold tp ~ 4.02. 
Thus |?/|(io) and |y|(i2) are false discoveries out of a total of hp = 12 discoveries. The empirical 
false discovery rate FDR = 2/12. (b) 100 out of 10,000. = = 4.52 for i = 1, . . . , tiq = 100; 
otherwise zero. Same FDR quantile boundary, q — 0.05. Now there are kp — 84 discoveries, yielding 
ip = 3.54 and FDR = 5/84. 



1.5 Certainty-Equivalent Heuristics for FDR-based thresholding 

How can FDR multiple-testing ideas be related to the performance of the corresponding 
estimator? Here we sketch a simple heuristic. 

Consider an 'in-mean' analysis of FDR thresholding. In the FDR definition, replace the 
observed data by the mean values fik, assumed to be already decreasing. Consider a 
pseudo-FDR index k^,{fl), found, assuming (T„ = 1, by solving for the crossing point 

P-k, = tk^.- 

Consider the case where the object of interest obeys the weak ip sparsity constraint 
H G mpiiln]- Weak £p has a natural 'extreme' sequence, namely Consider the 'in- 

mean' behavior at this extremal sequence; the crossing point relation (|1.5|) yields 

Using the relation tj. ~ y^2logn/k, valid for k = o{n), one sees quickly that 

the right hand side of this display is asymptotic to the correct minimax threshold for weak 
and strong ip balls of radius ry„! 
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Thus, FDR, in a heuristic, certainty-equivalent analysis, is able to determine the thresh- 
old appropriate to a given signal sparsity. Further, this calculation makes no reference to 
the loss function, and so we might hope that the whole range < r < 2 of £r error measures 
is covered. 

1.6 Main Results 

Given an error measure and 0„ C M", the worst-case risk of an estimator fi over B„ is 

p{fi,Qn) = sup E^Wfi- (1.9) 

The parameter spaces of interest to us will be those introduced earlier: 

i) e„ = £o[r?n] ("nearly black"), 

ii) Gn = mp[rjn]-, < p < r (weak-/p balls), and 

iii) Qn = ^pVln\-, < p < r (strong-/p balls). 

In these cases we will need to have — > with increasing n, reflective of increasing sparsity. 
For a given G„, the minimax risk is the best attainable worst-case risk: 

i?„(G„) = infp(/i,G„); (1.10) 

the infimum covers all estimators (measurable functions of the data). Any particular esti- 
mator, such as FDR, must have p{jlF,Qn) > -Rn(G„), but we might ask how inefficient jlp 
is relative to the "benchmark" for Gn provided by Rn{@n)- 

Theorem 1.1. Let y ~ Nn{p,(T'^I) and the FDR estimator fip be defined by In 
applying the FDR estimator, the FDR control parameter (qn, say) may depend on n, but 
suppose this has a limit q £ [0, 1). In addition, suppose qn > 7/log(n) for some 7 > and 
all n > 1. 

Use the £r risk measure / li..9)) where < p < r < 2. Let G„ be one of the parameter 
spaces detailed above with rjn S [n~^ log^ n, n^^], 6 > 0. Then as n —>■ 00, 

sup p{flF, P) = -Rn(G„){l + Urp— ^— + o(l)}, 

where Urp = 1 and Urp = 1 — (p/r) for strong- and weak-lp balls respectively. 

Hence, if the FDR control parameter q < 1/2, p{fiF, Gn) ~ i?„(G„) in the sense that the 
ratio approaches 1 as n ^ cxd. Otherwise, p{fLF,Qn) ~ c{q)Rn{Qn) for an explicit c{q) > 1 
growing with q. 

In short Theorem 11.11 establishes the asymptotic minimaxity of the FDR estimator in 
the setting of (|1.1|) - provided we control false discoveries so that there are more true 
discoveries than false ones. Moreover, this minimaxity is adaptive across various losses and 
sparse parameter spaces. 

This exhibits a tighter connection between False Discovery Rate ideas and adaptive 
minimaxity than one might have expected. The key parameter in the FDR theory - the 
rate itself - seems to be diagnostic for performance. 
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1.7 Interpretations 

Two remarks help place the above result in context. 



1.7.1 Comparison with other estimators 

The result may be compared to traditional results in the estimation of t he multivariate 
normal mean. Summarizing results given in Donoho and Johnstone! ( 1994^ 1: 



(i) Linear estimators attain the wrong rates of convergence when < p < r, over these 

parameter spaces; 

(ii) The James-Stein estimator, which is essentially a linear estimator with data-determined 

shrinkage factor, has the same defect as linear estimators; 



(iii) Thresholding at a fixed level, say cj„ • \/2 log n, does attain the right rates, but with 
the wrong constants for < p < r; 

(iv) Stein's unbiased risk estim ator (SURE) directly op t imize s the £2 error, and is adaptive 
for r = 2 and 1 < p < 2 ( Donoho and JohnstoneL 1995h . However, there appears to 



be a major technical (empirical process) barrier to extending this result to p < 1, 
and indeed, instability has b een observed in such cases in simulation experiments 
Donoho and Johnstonel (jl99fj ). Further, there is no reason to expect that optimizing 



an £2 criterion should also give optimality for error measures, p < r < 2. 

In short, traditional estimators are not able to achieve the desired level of adaptation to 
unknown sparsity. On the other hand, recent work by Johnstone and Silverman (2004), 
triggered by the present paper, exhibits an empirical Bayes estimator - EBayesThresh - 
which seems, in simulations, competitive with FDR thresholding, although the theoretical 
results for sparse cases are currently weaker. 



1.7.2 VaUdity of Simultaneous Minimaxity 

Minimax estimators are often criticised as being complicated, counter-intuitive and dis- 
tracted by irrelevant worst cases. An often-cited example is j5 = [x + ^/n/2]/[n + ^/n] for 
estimating a success probability p G [0,1] from X ~ Bin{n,p). Although this estimator 
is minimax for estimating p under squared-error loss, 'everybody' agrees that the common 
sense estimator x = x/n is 'obviously better' - better at most p and marginally worse only 
at p near 1/2. 

Perhaps surprisingly, simultaneous (asymptotic) minimaxity seems to avoid such objec- 
tions. Instead, to paraphrase an old dictum, it shifts the focus from an "exact solution 
to the wrong problem" to "an approximate solution to the right problem". To explain 
this, note that to develop a standard minimax solution, one starts with parameter space 
O and error measure || • || and finds a minimax estimator Ae.lHI attaining the minimum in 
(|1.10|) . This estimator may indeed be unsatisfactory in practice, for example because it 
may depend on aspects of that will not be known, or may be incorrectly specified. 

In contrast, we begin here with an a priori reasonable estimator fip whose definition does 
not depend on the imposed || • || and the presumably unknown 0„. Adaptive minimaxity 
for q < 1/2 - as established for /ii? in Theorem ll.il - shows that, for a large class of relevant 
parameter spaces 0„ and error measures || • ||, p{jj,n,&n) ~ Rn{&n)- In other words, the 
prespecified estimator /i„ is flexible enough to be approximately an optimal solution in 
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many situations of very different type (varying sparsity degree p, sparsity control r/„ and 
error measure r in the FDR example). 

Using large n asymptotics to exhibit approximately minimax solutions for finite n also 
renders the theory more flexible. For example in the binomial setting cited earlier, the 
standard estimator x = x/n, while not exactly minimax for finite n, is asymptotically 
minimax. More: if we consider in the binomial setting the parameter spaces Q[a,b] = {p ■ 
a < p < b}, then x is simultaneously asymptotically minimax for a very wide range of 
parameter spaces - each 0[a,fe] for 0<a<6<l - whereas p is asymptotically minimax 
only for special cases a < 1/2 < b. In short, whereas minimaxity violates common sense in 
the binomial case, simultaneous asymptotic minimaxity agrees with it perfectly. 

1.8 Penalized Estimators 

At the center of our paper is the study, not of fip, but of a family of complexity-penalized 
estimators. These yield approximations to FDR-controlling procedures, but seem far more 
amenable to direct mathematical analysis. Our study also allows us to exhibit connections 
of FDR control to several other recently proposed model selection methods. 
A penalized estimator is a minimizer of /i i-^ K(fi,y), where 

K{fi,y) = \\y-^,g + Pen{i2). (1.11) 

If the penalty term Pen{^) takes an Ip form, Pen{fj,) = A||;u||p, familiar estimators result: 
p = 2 gives linear shrinkage jli = (1 + A)~"^yj; while p = I yields soft thresholding fii = 
(sgnyj)(|yi|-A/2)+; for p = 0, Pen{fj,) = A||/x||o, gives hard thresholding /i, = yil{\yi\ > A}. 
Penalized FDR results from modifying the penalty to 

IImIIo 

Penif,) = Y,tf 
1=1 

Denote the resulting minimizer of (|1.11|) by fi2- For small \\^J.\\o, Pen{fi) ~ ^y^||o ' ll/^llo- 
therefore has the flavor of an ^o-penalty, but with the regularization parameter A replaced 
by the squared Gaussian quantile appropriate to the complexity \\fi\\o of fi. Further, (12 is 
indeed a variable hard threshold rule. If k2 is a minimizer of 

n k 

i=k+i 1=1 

then fi2,i = yil{\yi\ > 

The connection with original FDR arises as follows: k2 is the location of the global 
minimum of S^, while the FDR index kp is the rightmost local minimum. Similarly, we 
define kc as the leftmost local minimum of S^'- evidently kc < k2 < kp- For future 
reference, we will call kc the Step-Down FDR index. In practice, these indices are often 
identical. For theoretical purposes, we show fProposition lS.ll and Theorem l9.3() that kp — kc 
is uniformly small enough on our sparse parameter spaces 0„ that asymptotic minimaxity 
conclusions for fi2 can be carried over to fi p. 

To extend this story from £2 to ir losses, we make a straightforward translation: 

IImIIo 

/"r = argmin^ ||y - + ^ t[. (1.12) 

1=1 
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Again it follows that kj. G [kc-, kp]- Our strategy is, first, to prove ^^-loss optimality results 
using jlr, and later, to draw parallel conclusions for the original FDR rule fip. 

Why is the penalized form helpful? In tandem with the definition of jlr as the minimizer 
of an empirical complexity jl i— > K{fi,y), we can define the minimizer //q of the theoretical 
complexity p, K{fi, fi) obtained by replacing y by its expected value fi. By the very 
definition of (1^, we have K{fj,r,y) < K{^Q,y), and by simple manipulations one arrives (in 
the ^2 case here) at the basic bound, valid for all ^ G M": 

E\\fi2 - /uf < i^(/Uo, ^i) + 2E{Il2 -^^,z)- E^Pen{fL2). (1.13) 

Analysis of the individual terms on the right side is very revealing. Consider the theo- 
retical complexity term K{fiQ,fj,). For 0„ of type (i)-(iii) in the previous section, it turns 
out that the worst-case theoretical complexity is asymptotic to the minimax risk! Thus: 

sup K{fiQ,^) RniQn), n ^ oo. (1-14) 

The argument for this relation is rather easy, and will be given below in Section [9.21 The 
remaining term 2E{fi2 — fi,z) — E^Pen{jl2) in (|1.13() has the flavor of an error term of 
lower order. Detailed analysis is actually rather hard work, however. Section f9.3l overviews 
a lengthy argument, carried out in the immediately following sections, showing that this 
error term is indeed negligible over 0„ if g < 1/2, and of the order of Rn{Qn) otherwise. 

Plausibility for simultaneous asymptotic minimaxity of FDR is thus aid out for us very 
directly within the penalized FDR point of view. A full justification requires study of the 
theoretical complexity and the error term respectively. This fact permeates the architecture 
of the arguments to follow. 



1.9 Penalization by 2k\og{n/k) 

Penalization connects our work with a vast literature on model selection. Dating back to 
Akaike (1973), it has been popular to consider model selection rules of the form 

k = arg mmi^RSS{k) + 2a'^ ■ k ■ X, 

where A is the penalization parameter and RSS{k) stands for "the best residual sum of 
squares \\y — m\\2 for a model m with k parameters". The AIC model selection rule takes 
A = 1. Schwarz' BIC model selection rule takes A = log(n)/2, where n is the sample size. 
Foster and George's RIC model selection rule takes A = log(p), where p is the number of 
variables available for potential inclusion in the model. 

Several independent groups of researchers have recently proposed model selection rules 
with variable penalty factors. For convenience, we can refer to these as 21og(n//c) factors, 
yielding rules of the form 

k = arg mmi,RSS{k) + 2al ■ k ■ log{n/k). (1.15) 



Foster and St i^ ()l999l ) arrived at a penalty cr^ Yli ^ ^og{n/j) from information-theoretic 



considerations. Along sequences of k and n with n — > oo and k/n ^ 0, 2klog{n/k) 
E'=i21og(n/i). 
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For prediction problems. iTibshirani and Knight) (|1999|) propose model selection using 



a covariance inflation criterion which adjusts the training error by the average covari- 
ance of predictions and responses on permuted versions of the dataset. In the case of 
orthogonal regression, their proposal takes the form of complexity-penalized residual 
sum of squares, with the complexity penalty approximately of the above form, but 
larger by a factor of 2: 2(7^ y^j',^ 21og(n/7"). _There are in tri guing parallels between 



the co variance expression for the optimism lEfroij , Il986h in Tibshirani and Knight 
( 19991 . formula (6)) and the complexity bound (|1.13|) . 



George and Foster! 1)200(1 ) adopt an empirical Bayes approach, drawing the compo- 



nents Hi independently from a mixture prior {l — 'w)5o+'wN{0, C) and then estimating 
the hyperparameters {w, C) from the data y. They argue that the resulting estimator 
penalizes the addition of a A:th variable by a quantity close to 21og(^^^^ ~ !)■ 

Birge and Massarl] (|200ll 'l studied complexity-penalized model selection for a class 



of penalty functions, including penalties of the form 2a^klog{n/k). They develop 
non-asymptotic risk bounds for such procedures over Ip balls. 

Evidently, there is substantial interest in the use of variable-complexity penalties. There 
is also an extensive similarity of 2k log{n/k) penalties to FDR penalization. Penalized FDR 
fi2 from H1.12() can be written in penalized form with a variable-penalty factor \k,n- 

k2 = arg minj^RS S (k) + 2a'^kXk,n, 

where 

1 ^ f Iq \ ( kc[ \ 1 

^'^'^ = ^ ^ J ~ ^ V^^y ^ ^ log(n/A;) - -loglog(n/A;) +c(g,A:,n) 

for large n, k = o{n), and bounded remainder c (compare (12.7) below). FDR penalization 
is thus slightly weaker than 2k\og{n/k) penalization. We could also say that 2A:log(n/A;) 
penalties have a formal algebraic similarity to FDR penalties, but require a variable q = 
q{k,n) that is both small and decreasing with n. This perspective on 2k\og{n/k) penalties, 
suggests the following conjecture: 

Conjecture 1.2. In the setting of this paper, where 'model selection' means adaptive selec- 
tion of nonzero means, and the underlying estimand fj, belongs to one of the parameter spaces 
as detailed in Theorem M.lV the procedure \1.15\) is asymptotically minimax, simultaneously 
over the full range of parameter spaces and losses covered in that theorem. 

In short, although the 2k ■ log{n/k) rules were not proposed from a formal decision- 
theoretic, they might well exhibit simultaneous asymptotic minimaxity. We suspect that 
the methods developed in this paper may be extended to yield a proof of this conjecture. 



1.10 Take- Away Messages 

The theoretical results in this paper suggest the following two messages: 

TAM 1. FDR-based thresholding gives an optimal way of adapting to unknown sparsity: 
choose q < 1/2. In words, aiming for fewer false discoveries than true ones yields 
sharp asymptotic minimaxity. 

TAM 2. Recently proposed 2k\og{n/k) penalization schemes, when used in a sparse 
setting, may be viewed as similar to FDR-based thresholding. 



12 





Q 


step-up FDR 


penalized FDR 


step-down FDR 


n=1024 


0.01 


1.3440 


1.3440 


1.3440 




0.05 


1.3283 


1.3293 


1.3334 




0.25 


1.2473 


1.2482 


1.2512 




0.40 


1.2171 


1.2171 


1.2173 




0.50 


1.2339 


1.2335 


1.2321 




0.75 


1.4159 


1.4132 


1.4100 




0.99 


1.9810 


1.9744 


1.9687 


n=65536 


0.01 


1.3370 


1.3372 


1.3374 




0.05 


1.3178 


1.3180 


1.3183 




0.25 


1.2276 


1.2277 


1.2277 




0.40 


1.1889 


1.1889 


1.1890 




0.50 


1.1937 


1.1936 


1.1936 




0.75 


1.5122 


1.5118 


1.5114 




0.99 


4.0211 


4.0189 


4.0174 



Table 1: Ratios of MSE(FDR)/MSE(t*(p, n)), p = 3/2. 



1.11 Simulations 

We tested FDR thresholding and related procedures in simulation experiments. The out- 
comes support TAMs 1 and 2. 

Table ^ displays results from simulations at the so-called least-favorable case fik = 
min{n^^/^fc~^/P, y^(2 — p) log n} for the weak-^p parameter ball (compare remark following 
1)9. 13(1 ). Here p = 1.5, r = 2, n = 1024 and n = 65536, a = 1. The table records the ratio 
of squared-error risk of FDR to squared-error risk of the asymptotically optimal threshold 
t* = t*{p,n) = y^(2 — p) log n for that parameter ball (compare Section 3.3 below). All 
results derive from 100 repeated experiments. The standard errors of the MSEs were 
between 0.001-0.003 for n = 1024 and between 0.0005-0.0007 for n = 65536. 

These results should be co mpared with the behavior of 2 log(n/A;)-style penalties. For 
the estimator of iFoster and Stine ( 199(1 ). minimizing RSS + cr^ X^j=i 21og(n/j), we have 
that for n = 1024, MSE/MSE{t*) = 1.2308 while for n = 65536, MSE/MSE{t*) = 
1.2281. This is consistent with behavior that would result from FDR control with q = .2> 
for n = 1024 and q = .25 for n = 65536. 



In Figure 01 we display simulation results under a range of sample sizes. Apparently 
the minimum MSE occurs somewhere below q = 1/2. 

We propose the following interpretations: 

INT 1. FDR procedures with q < 1/2 have a risk which is a reasonable multiple of 
the 'ideal risk' based on the threshold which would have been optimal for the given 
sparsity of the object. The ratios in Tabled do not differ much for various q <l/2 
that demonstrates robustness of the FDR procedures towards the choice of q. 

INT 2. An FDR procedure with q near 1/2 appears to outperform (/-small procedures at 
this configuration, and achieving risks which roughly comparable to the ideal risk. 
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INT 3. Avoid FDR procedures with large q, in favor of q < 1/2. 



1.12 Contents 

The paper to fohow is far more technical than the introduction; in our view necessarily so, 
since much of the work concerns refined properties of fluctuations in the extreme upper 
tails of the normal order statistics. However, Sections 2-4 should be accessible on a first 
reading. They review pertinent information about FDR-controlling procedures, about min- 
imax estimation over ip balls, and parse our main result into an Upper Bound result and a 
Lower Bound result. Section 4 then gives an overview of the paper to follow, which carries 
out rigorous proofs of the Lower Bound (Sections 5-8, 13) and the Upper Bound (Sections 
9-11). 



2 The False Discovery Rate 

The field of Multiple Comparisons has developed many techniques to control the increased 
rate of type I error when testing a family of n hypotheses Hoi versus Hu, i = 1, 2, . . . , n. 
The traditional approach is to control the familywise error rate at some level a, that is, 
to use a testing procedure that controls at level a the probability of erroneously rejecting 
even one true Hoi. The venerable Bonferroni procedure tests ensures this by testing each 
hypothesis at the a/n level. 

The Bonferroni procedure is criticised as being too conservative, since it often lacks 
power to detect the alternative hypotheses. Much research has been devoted to devise 
more powerful procedures: tightening the probability ine qualities, and incorporat i ng the 
dep endency struct ure when it is known. For surveys, see iHochberg and Tamhanel (jl987l ) 
and IShaffer In one fundamental sense the success has been limited. Generally the 



power deteriorates substantially when the problem is large. As a result, many practitioners 
avoid altogether using any multiplicity adjustment to control for the increased type I errors 
ca used by simultaneous infer e nce. 

Beniamini and Hochbergj (jl 99^ argued that the control of the familywise error-rate 



is a very conservative goal which is not always necessary. They proposed to control the 
expected ratio of the number of erroneously rejected hypotheses to the number rejected 
- the False Discovery Rate (FDR). Formally, for any fixed configuration of true and false 
hypotheses, let V be the number of true null hypotheses erroneously rejected, among the 
R rejected hypotheses. Let Q be V/R if R > 0, and if i? = 0; set FDR = E{Q}, where 
the expectation is taken according to the same configuration. The FDR is equivalent to the 
familywise error-rate when all tested hypotheses are true, so an FDR-controlling procedure 
at level q controls the probability of making even one erroneous discovery in such a situation. 
Thus for many problems the value of q is naturally chosen at the conventional levels for 
tests of significance. The FDR of a multiple-testing procedure is never larger than the 
familywise error-rate. Hence controlling FDR admits more powerful procedures. 

Here is a simple step-up FDR-controlling procedure. Let the individual P— values for 
the hypotheses Hqi, be arranged in ascending order: Pj^j < . . . < P[n] - Compare the ordered 
P— values to a linear boundary i/nq, and note the last crossing time: 

kp = max{A; : Pj^] < {k/n)q}. (2-1) 

The FDR multiple-testing procedure is to reject all hypotheses -ff(oi) corresponding to the 
indices i = 1, . . . ,kF. If ks denotes the number of P— values below the Bonferroni cutoff 
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q/n it is apparent that kp > kB and hence that the FDR test conducted at the same level 
is necessarily less conservatiye. 

iBeniamini and Hochberl (|l99,^ considered the above testing procedure in the situation 
of independent hypothesis tests on many individual means. They considered the two-sided 
P-values from testing that each individual mean was zero. They found that the false 
discovery rate of the above multiple testing procedure is bounded by q whatever be the 
number of true null hypotheses uq or the configuration of the means under the alternatives: 



FDR = E^{Q} = qno/n < 



The multiple-testing procedure 



was propos ed informally by Elkund, 



for all fi e M". (2.2) 

(12.11) was prop os ed informally by Elkund, by ISeegeil 
( 1968h and much later independently bv lsime jlach ti i ne it was neglec t ed be cause 

it was shown not to control the familywise error-rate Seeeei ( 1968h HommeJ (|l988h ]. In 
the absence of the FDR concept, it was not understood why this procedure could be a 
good idea. After introduction of the FDR concept, it was recognized that kp had the 
FDR property, but also that othe r procedures o ffered FDR control - most importantly for 



the step-down estimator kc', ISarkar This rule, introduced in Section 1, will 



us, 

also be referred to frequently below, and our theorems are also applicable to thresholding 
estimators based upon it. 

As noted in the introduction. lAbramovich and Benjamin ] l|l99.^ adapted FDR testing 
to the setting of estimation, in particular of wavelet coefficients of an unknown regression 
function. In this setting, given n data on a unknown function observed in Gaussian white 
noise, we are testing n independent null hypotheses on function's wavelet coefficients fii = 0. 
Using the above formulation with two-sided P-values, we obtain (11.81). 



Previously in the same setting of wavelet estimation, Donoho and Johnstonel ( 1994 d ) 
had proposed to estimate wavelet coefficients by setting to zero a ll coeffic i ents b elow 
a certain "uii i versal threshold" y21og(n)o"„. A key observation in iDonohnI (jl 99,4 ) and 



a. 



Donoho et al.l (|1995|), about this threshold is that, with high probability, every truly zero 
wavelet coefficient is estimated by zero. 

Using ideas from simultaneous inference we can look at universal thresholding differently. 
The likelihood ratio test of the null hypothesis H^i : /ij = rejects if and only if \yi\ > ta, 
and the Bonferroni method at familywise level a sets the cutoff for rejection t at tBON = 
az{a/2n). Now very roughly, z{l/n) ~ y^2 log(n); much more precise results are derived 
below and lie at the center of our arguments. Hence, Bonferroni at any reasonable level a 
leads us to set a threshold not far from the universal threshold. Put another way, universal 
thresholding may be viewed as precisely a Bonferroni procedure, for a = a\(. We can derive 

^ X 1 / y^\og{n) as n oo. 

As was emphasised by Abramovich and Beniaminil ( 19961 ) , the FDR estimator can choose 
lower thresholds than an ■ \/21ogn when kp is relatively large. It thus offers the possibility 
of adapting to the unknown mean vector by adapting to the data, choosing less conservative 
thresholds when significant signal is present. It is this possibility we explore here. 

Pointers to the FDR literature more generally. The above discussion of FDR threshold- 
ing emphasizes just that 'slice' of the FDR literature needed for this paper, so it is highly 
selective. The literature of FDR methodology is growing rapidly, and is too diverse to 
adequately summarize here. Recent papers have illuminated the FDR from different points 
of view: asymptotic, Bayesian, Empirical Bayes, and as the limit of empirical processes 
(Efron et al. (2001), Storey (2002), Genovese and Wasserman (2002). 

Another line of work, starting with Benjamini and Hochberg (2000), addresses the factor 
no/n in p.2|) : many methods have been offered to estimate this, followed by the step-up 



16 



procedure with the adjusted (larger) q. The results of Benjamini et al. (2001) and Storey 
et al. (2004) assure FDR control under independence. When no/n is close to 1, as is our 
case in this paper, such methods are close to the original step-up procedure. 

An immediate next step beyond this paper would be to study dependent situations. 
The FDR-controlling property of the step-up procedure under positive dependence has 
been established in Benjamini and Yekutieli (2001), and similar results were derived for 
the step-down version in Sarkar (2002). Since much of the formal structure below is based 
on marginal properties of the observations, this raises the possibility that our estimation 
results would extend to a broader class of situations involving dependence in the noise terms 

Zi. 



3 Minimax estimation on £q, m 



As a prelude to the formulation of the adaptive min i maxitv results, w e review information 
( Donoho et"aD . ll992l : 5onoho and ■TohnstoneLri994d : ljohnstoneLll994h on minimax estima- 
tion over io, ip and weak ip balls in the sparse case: < p < 2 and with normalized radius 
— > as n — > oo. Throughout this section, we suppose a shift Gaussian model ()1.1() with 
unit noise level 0"^ = 1. We will denote the risk of an estimator jl under £r loss by 

p{fl,fi) = Ef,\\fi- fiWl. 

Particularly important classes of estimators are obtained by thresholding of individual 
co-ordinates: hard thresholding was defined at ()1.2I) . while soft thresholding of a single 
co-ordinate yi is given by T]s{yi,t) = sgn(yi)(|yi| — t)+. We use a special notation for the 
risk function of thresholding on a single scalar observation yi N{fii, 1): 



Ps{t,^J'l) = Ef,^\r]s{yi,t) - ni\ 



with an analogous definition of pH{t,fJ.i) for hard thresholding. 



3.1 io balls 

Asymptotically least-favorable configurations for £o balls Iq [ry„] can be built by drawing the 
Pi i.i.d. from sparse two-point prior distributions 

TT = (1 - r]n)6o +VnSf^„, Pn ~ (2 log 7/^^)^/^. 

The precise definition of pn is given in the Remark below. The expected number of non- 
zero components pi is kn = nr]n- The prior is constructed so that the corresponding Bayes 
estimator essentially estimates zero even for those pi drawn from the atom at pn, and so the 
Bayes estimator has an Ir risk of at least knPn- A corresponding asymptotically minimax 
estimator is given by soft or hard thresholding at threshold Tr, = T{r]n) '■= (2 log r/~^)^/^ ~ 
/in as n — > oo. This estimator achieves the precise asymptotics of the minimax risk, namely: 

-Rn(4[f?n]) ~ knPn = UrjnPn ~ nr]n{2log rj'^Y^"^ . (3.1) 

Remark. Given a sequence = o{log7]~^) that increases slowly to oo, pn is defined as 
the solution of the equation (j){an + Pn) = Vn'Pio.n) , where (p denotes the standard Gaussian 
density. Equivalently, p"^ + 2anPn = 21ogT/~^ = giving the more precise relation 

Trj = Pn + an + o{an). (3.2) 
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Thus Tff — fin ^ oo, which for both soft or hard thresholding at Trj indicates p{Tri; fJ-n) ~ fJ-n- 
[Note also that, to simplify notation, we are using to denote a sequence of constants 
rather than the n**^ component of the vector /i]. 



3.2 ip balls 

Again, asymptotically least favorable configurations for £p[rin] are obtained by i.i.d. draws 
from TT = (1 — (3n)S() + PnSp„, where now the mass of the non-zero atom and its location 
are, informally given by the pair of properties 

/3n = ^>n^ ;U„~(21og/3-i)^/^ (3.3) 

More precisely /i„ = ^„(7/„,a„;p) is now the solution of </>(a„ + = /?„(/)(a„), which 
implies that 

^„~r^ = (21ogr/-f)'/' n^oo, (3.4) 

and then that ()3.2() continues to hold for ip balls. The expected number of non-zero com- 
ponents Hi is now nPn = nrf^fin^ . For later use, we define 

since /i„ ~ Tri, we have kn ~ nf3n, and so is effectively the non-zero number. With 
similar heuristics for the Bayes estimator, the exact asymptotics of minimax risk becomes 

RniUVn]) ~ Klfn = nr?^r" = nr?^ (2 log 7?"^') . (3.6) 

Asymptotic minimaxity is had by thresholding at = (2 log r^n^)^^^ ~ (2 log n/Zsn)-*^/^. 



3.3 Weak £p balls 

The weak Ip ball mp[rin] contains the corresponding str ong £p ball ip\^n\ with the same 
radius, and the asymptotic minimax risk is larger by a constant factor: 

Rnimplrin]) ~ (r/ (r - p))i?„(^p[r/„]), n ^ oo. (3.7) 

Let Fp{x) = 1 — x~P,x > 1 denote the distribution function of the Pareto(p) distribution 
and let X be a random variable having this law. Then an asymptotically least favorable 
distribution for mp[r/„] is given by drawing n i.i.d. samples from the univariate law 

TTi = £(min(?7„X,/i„)), (3.8) 

where /i„ is defined exactly as in the strong case. The mass of the prior probability atom 
at fin equals Fp{dx/rfn) = rjnfin^ = /3n, again as in the strong case. Thus, the weak 
prior can be thought of as being obtained from the strong prior by smearing the atom at 
out over the interval fjLn] according to a Pareto density with scale One can see the 
origin of t he extra factor in the minimax risk from the following outline (for details when 
r = 2, see Johnston^ (|l994h ). The minimax theorem says that i?(mp[T/„]) equals the Bayes 



risk of the least favorable prior. This prior is roughly the product of n copies of vri, and the 
corresponding Bayes estimator is approximately (for large n) soft thresholding at r^, so 



Rn{mp[rfn]) ^ n / psirr,, fi) Tri{dfi) 
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Now consider an approximation to the risk function of soft thresholding, again at threshold 
tr,- Indeed, using the estimate psit, fi) = ps{t, 0) + |/i|''', appropriate in the range < p < Pn, 
ignoring the term ps{t,0) and reasoning as before H3.6() we find 

Rn{mp[rjn]) ^ n I ps{tr,, p)Fp{dfi/rin) + nf3nPs{tr,, Pn) (3.9) 

= n fi'- pr,Pfi-P-^dfi + KpI (3.10) 

Jrjn 

= + l]n7?P/z:r^ ~ ^R{ip[rin]). (3.11) 

r — p r — p 

Comparison with (|3.6|) shows that the second term in (|.S.9|) - (|3.1fl|) corresponds exactly to 
R-i^piiln])] the first term is contributed by the Pareto density in the weak-^p case. 

4 Adaptive Minimaxity of FDR Thresholding 

We now survey the path to our main result, providing in this section an overview of the 
remainder of the paper and the arguments to come. 

What we ultimately prove is broader than the result given in the introduction, and the 
argument will develop several ideas seemingly of broader interest. 

4.1 General Assumptions 

Continually below we invoke a collection of assumptions (Q),(H),(E), and (A) defined here. 

• False discovery control. We allow false-discovery rates to depend on n, but approach 
a limit as n ^ oo. Moreover, if the limit is zero, rates should not go to zero very fast. 
Formally define the assumption: 

(Q) Suppose that Qn ^ q ^ [Oi !)• If = 0, assume that > 61/ log n. 
The constant 61 > is arbitrary; its value could be important at a specific n. 

• Sparsity of the estimand. We consider only parameter sets which are sparse, and we 
place quantitative upper bounds keeping them away from the 'dense' case. Formally 
define: 

(H) r]n (for ^o[^n]) and rjn (for mp[r/„]) lie in the interval [n~"^ log^ n, 62?^"''^]. 

Here the constants 62 > and 63 > are arbitrary; their chosen values could again 
be important in finite samples 

• Diversity of Estimators. Our results apply not just to the usual FDR-based estimator 
(If of (1.7) but also the penalty-based estimator fi^ of (1.11). More generally, recall 
the terms Itp^tc] defined in Section [I.8I Under formal assumption (E), we consider 
any estimator jl obeying 

fi = hard thresholding at t^[tp,tQ\ w.p. 1. (4.1) 

• Notation. We introduce a sequence which often appears in estimates in Sections 
7 and 8 and in dependent material. We also define constants q' and q" . Formally: 
(A) Set an = 1/(^47",,), with 64 = (1 - g)/4. Also set q' = {q + l)/2 and q" = 
(1 - q)/2 = l-q'. 
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Finally, as a global matter, we suppose that our observations y ~ Nn{^, I); thus cx^ = 1. 
For estimation of /i, we consider risk (|1.9|) . < r < 2, and minimax risk i?„(0„) of (jl.lUj) . 
Here the parameter spaces are 0^^ = ^o[^n] or £p [r]n] or mp[r]n] defined by (lOl), and 
H1.4|l respectively, with < p < r. 

4.2 Upper Bound Result 

Our argument for Theorem 11.11 splits into two parts, beginning with an upper bound on 
minimax risk: 

Theorem 4.1. Assume (H),(E),(Q). Then, as n ^ oo, 

sup fi) < i?„(e„){l + + 0(1)}, (4.2) 

fieQn ^ — Qn 

where Urp = 1 - (p/r) if 9„ = mp[r?„] and Urp = I if @n = ipiVn] or io[rin]- 

The bare bones of our strategy for proving the upper bound result were described in 
the Introduction. The global idea is to study the penalized FDR estimator fi2 of (|1.8)) and 
then compare to the behavior of fip- To make this work, numerous technical facts will 
be needed concerning the behavior of hard thresholding, the mean and fluctuations of the 
threshold exceedance process, and so on. As it turns out, those same technical facts form 
the core of our lower bound on the risk behavior of fip- As a result, it is convenient for 
us to study the lower bound and associated technical machinery first, in Sections 5-8 (with 
some details deferred to Sections 12 and 13), and then later, in Sections 9-11, to prove the 
upper bound, using results and viewpoints established earlier. 

4.3 Lower Bound Result 

Theorem II. II is completed by a lower bound on the behavior of the FDR estimator. 
Theorem 4.2. Suppose (H),(Q). With notation as in Theorem \4- 1[ 

sup p{fiF,fJ-) > Rn{Qn){l + Urp^-^ 3^ + o(l)}, 71 ^ OO, (4.3) 

peB„ 1 — Q'n 

where Urp = 1 - {p/r) for 0„ = rrip and Urp = 1 for 0„ = £p [??„]. 

This bound establishes the importance of q; showing that if g > 1/2, then certainly 
FDR cannot be asymptotically minimax. We turn immediately to its proof. 

5 Proof of the Lower Bound 

The proof involves three technical but significant ideas. First, it bounds the number of 
discoveries made by FDR, as a function of the underlying means /i. Second, it studies 
the risk of ordinary hard thresholding with non-adaptive threshold in a specially chosen, 
(quasi-) least-favorable one-parameter subfamily of Qn- Finally, it combines these elements 
to show that, on this least-favorable subfamily, fip behaves like hard thresholding with a 
particular threshold. The lower bound result then follows. Unavoidably, the results in this 
section will invoke lemmas and corollaries only proven in later sections. 

Beyond simply proving the lower bound, this section introduces some basic viewpoints 
and notions. These include 
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• A threshold exceedance function M, which counts the number of threshold exceedances 
as a function of the underlying means vector. 

• A special 'coordinate system' for thresholds, mapping thresholds t onto the scale of 
relative expected exceedances. 

• A special one-parameter (quasi-) least-favorable family for FDR, at which the lower 
bound is established. 

These notions will be used heavily in later sections. 

5.1 Mean Exceedances and Mean Discoveries 

Define the exceedance number N{tk) = #{i ■ \yi\ > \tk\}- Since |y(fc)| > \tk\ if and only if 
^(tk) ^ k, Thus, we are interested in the values of k for which N{tk) ^ k. (See Section [7| 
for details). 

Throughout the paper we will refer to the mean threshold exceedance function, counting 
the mean number of exceedances over threshold as k varies: 

n n 

M{k■,^^)=E^N{tk) = Y,PM>tk) = Y.^(^^'^-*k,^^l + tkr). (5.4) 

/=1 1=1 

[Here ^{A) denotes the probability of event A under the standard Gaussian probability 
distribution, and k is extended from positive integer to positive real values.] If /i = 0, then 
M(/c; fi) = 2n$(tfc) = qk, reflecting the fact that in the null case, the fraction of exceedances 
is always governed by the FDR parameter q. If 7^ 0, we expect that kp will be close to 
the mean discovery number 

k{fi) = mi{k G M+ : M{k; n) = k}. (5.5) 

The existence and uniqueness of k{^) when ^ 7^ follows from facts to be established in 
Section [6.31 that (taking k as real- valued), the function k — > M{k]^)/k decreases strictly 
and continuously from a limit of -|-oo as A; —> to a limit < 1, as A; ^ n. 

A key point is that the mean discovery number is bounded over the parameter spaces 
G„. The mean discovery number is monotone in if > |/^2|(fc) for all k, then 

k{^i) > k{^2)- Thus, on ^o[^n]; the largest mean discovery number M, say, is obtained by 
taking kn = [nr]ri\ components to be very large. Writing this out: 

Mn{k) = sup M(/c;/i) = A;„ + 2{n - kn)^{tk) 

= kn + {l- kn/n)kqn ^ kn + kqn, 

using the definition of tk = z{kqn/2n) and ?]„ ~ k^/ri — > 0. The first term corresponds to 
"true" discoveries, and the second to "false" ones. Solving M{k) = k yields a solution 

k = kn/{l - (1 - n~^kn)qn) ~ kn/{l - qn)- (5.6) 

In particular, for all ^ G ^o[^n]) we have k{pL) < kn/{l — Qni^ + o{l)). 

Weak £p. On 0„ = mp[ri„], E^N{tk) is maximized by taking the components of /i as large as 
possible - i.e. at the coordinatewise upper bound fii — r]n{n/lY^P . Thus now 

M„(fc)= sup Ef,N{tk) ^ E-^N{tk). 

mn[7]n] 
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To approximate M{k; fi), note first that the summands in H5.4|l are decreasing from nearly 1 for /i; 
large to 2^{tk) when /i/ is near 0. With k held fixed, break the sum into two parts using fc„ — nT^JJr^P. 
[This choice is explained in more detail after (|9.13|) below.] For I < kn, the summands are mostly 
well approximated by 1, and for / > fc„ predominantly by 2^{tk) — qk/n. Since kn/n « 0, we have 

M{v; p.) ^ kn + [n - kn)qnv/n 

Again the first term tracks "true" discoveries and the second "false" ones. Solving M{v] p) = v 
based on this approximation suggests that, just as in (|5.t)|l . 

fc(Ai)< W(l -<?«)(! + (5.7) 

The full proof is given in Section [6.4.41 

Strong tp. Since (l.p[rjn] C mp[rjn], \b.l\ applies here as well. 



5.2 Typical behavior kp and kc 

We turn to the stochastic quantities kp and kc- These are defined in terms of the exceedance 
numbers N(tk), which themselves depend on independent (and non-identically distributed) 
Bernoulli variables. This suggests the use of bounds on kp and kc derived from large devi- 
ations properties of N{tk). Since we are concerned mainly with relatively high thresholds 
tfc, results appropriate to Poisson regimes are required. Details are in Section Q 

To describe the resulting bounds on [kckp], we first introduce some terminology. We 
say that an event An{fi) is 0„-likely if there exist constants cq,ci not depending on n and 
O such that 

sup P^{A'^{fi)} < coexp{-cilog^n}. 
With an as in assumption (A), define 

7 / A \k{fi)-ankn k{^)>2ankn 

^'^^^ = In t-r ^ / 9 i. '^^■^^ 

[0 k[fi) < 2ankn, 

and 

k+{jl) = k{jl) V ankn + Q-nkn- (5.9) 

Proposition 5.1. Assume (Q), (H) and (A). For each of the parameter spaces Qn, it is 
Qn— likely that 

k-ifJ-) < kc < kp < ^+(/u)- 

Thus all the penalized estimates kr (and any k G [kckp]) are with uniformly high 
probability bracketed between A;_(/i) and k+{p,). In particular, note that 

k+{fi) - k-{fi) < 3ankn, (5.10) 

and so the fluctuations in kr are typically small compared to the maximal value over 0„. 

Here and below it is convenient to have a notational variant for , used especially when 
the subscript would be very complicated; so deflne 

t[k] = z{2n/kq); 
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keep in mind that t depends implicitly on Q — Qrj^ <ind 72. We occcisionally use tf^ when the 
subscript is very simple. 

Giving this notation its first workout, the thresholds tr are bracketed between 

t+{fj,) = t[k^{fi)] and t^ifi) = t[k+{n)]. (5.11) 

Note that t+ > t_, but from (|5l(H) and (|12.14|) . it follows that t+/t- < 1 + 3a„/t?_. 

5.3 Risk of Hard Thresholding 

We now study the error of fixed thresholds as a prelude to the study of the data-dependent 
FDR thresholds. We define one-parameter families of configurations and of thresholds which 
exhibit key transitional behavior. As might be expected, these are concentrated around the 
critical threshold = \/ 2 log rjn^ corresponding to sparsity level r/„. 

Consider first a family of (quasi-) least-favorable means fia- The coordinates take one 
of two values, most being zero, and the others amounting to a fraction ry with value roughly 
Trf + a. Specifically, for a G R, set 

]t[kn]+a l<kn, 

f^a,l = < , , ^ (5.12) 

[0 kn < I <n. 

In a sense /io,A;„ is right at the FDR boundary, while with a > 0, fJ,a,kn is above the 
boundary and with a < is it below the boundary. 

Next, consider a 'coordinate system' for measuring the height of thresholds in the vicin- 
ity of the FDR boundary. Think of thresholds {t : t > 0} as generated by {t[aA;„ ],a > 0}, 
with a fixed while n and kn increase. For a = 1, we are on the FDR boundary at /c^, so 
that a < 1 is above the boundary and a > 1 is below the boundary. The 'coordinate' a will 
be heavily used in what follows. 

In fact, these thresholds vary only slowly with a: for a fixed, as n ^ oo, 

\t[akn] - t[kn]\ < c{a)T~\ (5.13) 
Nevertheless, the effect of a is visible in the leading term of the risk: 

Proposition 5.2. Let a G M and a > be fixed. Let the configuration ^„ G ^oiv] defined 
by (15.121) . For ir loss, the risk of hard thresholding at t[akn] is given, as n ^ oo, by 

P{f^H,t[ak„],f^a) = [^(a) +aqn] ■ knT^ ■ {1 + o{l)). (5.14) 

Here fe„r^ is asymptotic to the minimax risk for ^o['?n] - compare (|3.1|) - and so defines 
the benchmark for comparison. 

The two leading terms in (|5.14)) reflect false negatives and false positives respectively. 
The proof is given in Section 1131 Here we aim only to explain how these terms arise. 

The false-negative term $(a)A:„r^ decreases as a increases. This is natural, as the 
signals with mean ma = t[kn] + a become easier to detect as a increases - whatever be the 
threshold t[akn]. More precisely, the Ir-eicroi due to non-detection, \yi\ < t[akn], on each of 
the kn terms with mean rua contributes risk 

knml^Pm^{\yi\ < t[akn]) ~ A;„r^$(t[aA;„] - m^), (5.15) 

since nia ^ t^i as n ^ oo. Finally, (|5.13j) shows that 

rUa — t[akn] = a + t[kn] — t[akn] = a + 0{t~^), (5.16) 
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so that (|5.15|) is approximately l>(a)A;„r^. 

The false-positive term shows a relatively subtle dependence on threshold a t[akn]. 
There are n — kn means that are exactly 0, and so the risk due to false discoveries is 

(n - kn)E{\z\'' : \z\ > t[ak„]} ~ 2nt[aknY^it[akn]) (5.17) 

= aknqnt[aknY (5.18) 
~ aqnknT^. (5.19) 

[ (|5.17() follows from 1)8. 6() below, while 1)5. 18() uses the definition of the FDR boundary, 
t[kn] = <l"^(A;„g„/2n), and finally follows from (112.191) below.] 

Weak ip. In this case, we replace the two-point configuration by Winsorized analogs 
in the spirit of Section \'A.'A\ 

fJ-a,i = fii ^ ma, ma = t[kn]+a. (5.20) 

Now an extra term appears in the risk of hard thresholding when using thresholds t[akri]: 

Proposition 5.3. Adopt the setting of Proposition replacing only (|5.12() by ()5.2U() . 

Then 

aqn] ■ knT^ ■ (1 + o(l)). 

The same phenomena as for ^o[^?] apply here, except that the p/{r — p) term arises 
due to the cumulative effect of missed detections of means fii that are smaller than rua but 
certainly not 0. This term decreases as p becomes smaller, essentially due to the increasingly 
fast decay of fn = Cnl^^^^- The term disappears in the p ^ limit, and we recover the io— 
risk ()5.14|) . This result also is proved in Section [T^ 

5.4 FDR on the Least Favorable Family 

To track the response of the FDR estimator to members of the family {fia, a G M}, we look 
first at the mean discovery numbers. In Section^] we prove: 

Proposition 5.4. Assume (Q), (H) and (A). Fix a G M and define fia by ()5.12|) and (|5.2U|) 

for io[r]n] and mp[r]n] respectively. Then as n ^ oo, 

kifia) {1 - qn)-^<^ia)kn. (5.21) 

Heuristically, for ^o[^n], we approximate 

M{k; ^la) = kni^itk - m„) + $(-tfc - m„)] + 2(n - /cn)^(tfe) (5.22) 
~ kn^irua - tk) + qnk ~ kn^ia) + qnk, 

from (|5.16j) . Solving M(k;Ha) = k based on this approximation leads to (|5.21j) . The same 
approach works for mp[r]n], but with more attention needed to bounding the component 
terms in M{k;Ha) as detailed in Sectional 

Proposition 15.41 suggests that at configuration fi^, FDR will choose a threshold close to 
t[k{ij.a)], which is of the form t[akn\ with a ~ (1 — q)~^^{a). Thus, as a increases, and 
with it the non-zero components of ^q, the FDR threshold decreases, albeit modestly. 

The risk incurred by FDR at fi = fia corresponds to that of hard thresholding at 
t[k{iJ,a)]- In Section 13 below we prove: 
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Proposition 5.5. Assume (Q), (H) and (A). Fix a G M and consider /Iq, defined by ()5.12|) 
for ^o[^n]- Then as n ^ oo, 

p{fiF,tia)= + $ (a) -^1 -fcnr;- (l + o(l)). (5.23) 

On the other hand, define fia using 1)5.20(1 for mp[rin]; then 

p{fiF,fia)= [!'(«) + ^ + $(a)-^l -A^nrl -(1 + 0(1)). (5.24) 
L r-p 1-gn-l 

Formula (|5.23|) shows visibly the role of the FDR control parameter q. Note that 

sup l>(a) + $(a)g/(l - g) = I ^ ^ ^ J^^ (5-25) 

Consider the implications of this in (|5.23|) in the ^o[^n] case. The minimax risk ~ knT^, 
and so the minimax risk is exceeded asymptotically whenever q > 1/2. 

We interpret this further. When q < 1/2, the worst configurations in {//a} correspond 
to a large negative, and yield essentially the minimax risk. Indeed, only $(a) of the true 
non-zero means are discovered. Each missed mean contributes risk ~ ~ t"^ and so the 
risk due to missed means is given roughly by $(a)/c„T^. The risk contribution due to false 
discoveries, being controlled by ^{a), is negligible in these configurations. 

When q > 1/2, the worst configurations in {pa} correspond to a large and positive. 
Essentially all of the /c„ non-zero components are correctly discovered, along with a fraction 
q of the kn{fia) ~ (1 — q)^^^{a)kn which are false discoveries. In the Iq case, the false 
discoveries dominate the risk, yielding an error of order ^{a)[q/{l — q)]knT^- 

When q = 1/2, the risk p{fiF-,p.a) ~ knT^ regardless of a, so that all configurations /i^ 
are equally bad, even though the fraction <I>(a) of risk due to false discoveries changes from 
to 1 as nia = t[kri\ + a increases from values below through values above. 

5.5 Proof of the Lower Bound 

Our interpretation of Proposition (|5.5() in effect gave away the idea for the proof of (|4.3() . 
We now fill in a few details. 

In the ioVin] case, fix e > 0. Choosing a = a(e; q) sufficiently large positive or negative 
according as g > 1/2 oy q < 1/2, we get 

4'(a(e)) + — ^$(a(e)) > (1 - e/2) • sup4'(a) + -^<^{a). 
1- q a l-q 

(|5.23|) gives that for large n, 

pif^F,l^aie)) > [1 + • ^nr; • (1 - e). 

But the family fia £ ^o[^n]i and Rni^oiVn]) ~ knT^ hence 

pifLpJoiVn]) > p{f^F,Pa{e)) > [1 + ^^f^^l " RniUVn]) ' (1 " e). 

As this is true for all e > 0, the ioiiln] case of (|4.3|) follows. 
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For the ^p[?/n] case, fix e > 0, and choose again a = a(e; q) as in the io case. Note that 
nia is imphcitly a function ma[kn] of the number of nonzeros. Define the pair and k'^ 
informally as the joint solution of 

(A formal definition can be made using the approach in Section 3.2). Now the mean vector 
fj.'^ with k'^ nonzeros each taking value malk'^] gives, again by (|5.23|) . that for large n, 

p(/iF,/ia) > [1 + ■ k'nMk'^W ■ (1 - 6). 

Now from Section 3.2, Rni^piVn)) ~ nr}^T^~^, while 



n — > oo. 



At the same time, fia S ip[rin]- Hence 

pifiF^piVn]) > p{fj'F,f^a{e)) > [1 + _^^^ ] " Rn{(p[Vn]) ' (1 " e)- 

Again this holds for all e > 0, and the ^p[?7n] case of ()4.3|) follows. 

The argument for 1)4. 3() in the mp[??„] case is entirely parallel, only using (|5.24|) and the 
modified definition of /U^ for the rup case. 



5.6 Infrastructure for the Lower Bound 

We look ahead now to the arguments supporting the Propositions of this section. 

Propositions 15 . 215 .41 will be proved in SectionElat the very end of the paper. The proofs 
exploit viewpoints and estimates set forth in Sections IB1I51 Section ^ sets out properties of 
the mean detection function — > M{i/; fj,) of (|5.4|) and its derivatives, with a view to 
deriving information and bounds on the discovery number k{fi) of Section 5.1. 

Section [3 applies these bounds in combination with the large deviation bounds to prove 
Proposition 15. II and show that kp — kc < Sonkn- Section |H1 collects various bounds on the 
risk of fixed thresholds, and the risk difference between two data dependent thresholds. 

All these sections frequently invoke a very useful appendix. Section [T^ which assembles 
needed properties of the Gaussian distribution, of the quantile function z{r]) = ^^^{rj) and 
of implications of the FDR boundary t^. 



6 The Mean Detection Function 

6.1 Comparing weak £p with £o: the effective non-zero fraction 

A key feature of £o[^n] is that only kn = nijn coordinates may be non-zero. Consequently, 
the number of 'discoveries' at threshold t[z^] from n — kn zero co-ordinates is at most linear 
in I' with slope 

(n - kn)pu{0) = n{l - r]n)i^qn/n < qn^^, (6.1) 

since Pu{0) = 2<I>(t[i^]) = q-nv/n. 

In the case of weak Ip, the discussion around H9.12() - H9.13|) showed that for certain 
purposes, the index /c„ = nrfnT^^ counts the maximum number of 'significantly' non-zero 
co-ordinates. 
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In this section, we will see that an alternate, slightly larger index, k'j^ = nrffiT^ yields for 
weak the property analogous to (|6.1() : the number of discoveries at ty from the n — k'^ 
smallest means fii is not essentially larger than qniy. At least for the extremal configuration 
(//;), the range of indices [A;„,/c^] constitutes a 'transition' zone between 'clearly non-zero' 
means and 'effectively zero' ones: this is discussed further in Section [6.41 

To state the result, we need a certain error-control function; let 

5p{()=pe dw<\ (6.2) 

Je [(loge ^)e p = 1. 

Lemma 6.1. A SSUTflG hypotheses (Q) Clfld (H). Let — 21og7y72^ ClTld Cn — '^n'^rj 

and 6p{e) 

be defined as above. There exists c = 0(61,63) > such that for v with t^ > 2, we have, 
uniformly in fj, £ mp[r]n] that 

[1 - e^^]qnV < ^ Pvil^i) < [1 + c5p{en)]qnV. (6.3) 

The proof is deferred to Section [6.41 - compare the proof of ()6.29() there. 

Remark. Suppose Qn ^ q £ [0, 1). Then for n sufficiently large (i.e. n larger than some 
no depending on p,q, and the particular sequence ??„), it follows that 

M /(3/2)gn if gn<l/2 

[(1 + q)/2 if 1/2 < g„ < 1, 

in particular, the right side is strictly less than 1. 

We have just defined qn for the case of nip. If in the nearly-black case (^o[??n]) we agree 
to define g„ = qn then we may write both conclusions (|6.1|) and 1)6. 3|) in one unified form: 

Puikl-l) < QnV. (6.5) 

The "true positive" rate. Adopt for a moment the language of diagnostic testing and 
call those means with fii ^ "positives" , and those with ni = "negatives" . In the nearly- 
black case there are typically /c„ positives out of n. In the weak-^p case, there are formally 
as many as n positives, but as argued above, there are effectively k'^ = nrjnt^ positives, and 
it is this interpretation that we take here. If it is assumed (without loss of generality) that 
^1 ^ ^2 ^ • • • ^ ^ 0, then the true positive rate using threshold t[v] is defined by 

k' 

^A^) = {l/k'^)Y,PM)- (6.6) 
1 

In our sparse settings, if the mean discovery number for /i exceeds then there is a lower 
bound on the true positive rate at pL using threshold t[v]. 

Corollary 6.2. Assume (Q) and (H) and define qn by (j6.4p . If n is sufficiently large, 
then uniformly over pi in mp[r] n] or ^o[^n] for which ki^ji) > u , the true positive rate using 
threshold t[u\ satisfies 

Tt.ifl) > (l-qn){iy/k'^). 
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Proof. From the definition of the mean exceedance number, we have 

l>k'„ 

Since z/ < k{fj,) we have M{u; fi) > h', and to bound the sum we use ()6.5() . Hence k'^Tt^{fj,) > 
u — Qni^, as required. □ 



6.2 Convexity properties of exceedances 

The goal of this subsection, Corollarv l6.51 shows that a lower bound on the mean discovery 
number k(fj,) forces a lower bound on the mean threshold function v M[v] fj.) at least 
on sparse parameter sets. The idea is to establish convexity of a certain power function 
associated with testing individual components ni and then to use the convexity to construct 
two-point configurations providing the needed lower bounds. 

Let N{tk) = #{l : \yi\ > t^}, and as before M{k;n) = E^N(tk) = Yll=iPkifJ'i)- Here, 
the exceedance probability for threshold tk is given by 

Pkin) = P{\Z + ^1 > tk] = ^{tk - /U) + $(-tfc - ;U), 

and we note that as ^ increases from to oo, increases from Pfc(O) = 2$(ifc) = qk/n to 
Pk{oo) = 1. It has derivative 

PkilJ-) = 4>{tk - /^) - H^k + /^) > 0, /X G (0, oo). 

Since fi PkifJ-) is strictly monotone, the inverse function //fc(7r) = /^[vr; k] = ^^"'^(vr) exists 
for qk/n < TT < 1. In the language of testing, consider the two-sided test of Hq : /x = that 
rejects at t^. Then ^^(vr) is that alternative fi at which the test has power vr. In addition 

± -u 1 _ 1 

d^""' P'kiPkH^)) pW 

The bi-threshold function. Given indices < A;, so that ti, > t^, consider minimizing 
M(i^; /i) over /i subject to the constraint that M{k; /x) stay fixed. Introduce variables 
TTi = pk{fJ.i): we wish to minimize 

M{i^]IJ.) = = ^^PuiPk^i-n-i)) subject to ^vr^ = m. 

I I I 

Define a bi-threshold function 

5(7r) = 9,y,k{'^) = PuiPk^i.-^)), qk/n<iT<l. 

Thus, gu,kiT^) gives the power of the test based on the ti^-threshold at the alternative where 
the tfc-threshold has power vr. As < k, gp^ki'^) < 

Lemma 6.3. If v < k then vr — > gu,k{'^) is convex and increasing from qu/n to 1 for 
TT G [qk/n, 1]. 

Proof. Setting /i = p^^(7r), we have 

pKm) (l){tk - - (f){tk + fJ.) sinhtkfi 
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To complete the proof, we show that if t > u, then the function = f{t,iJ,)/f{u,fj,) is 

increasing for f{t,^) = sinht/i. First note that 

G'(/i) = G(/x) f DsD^{logf){s,f,)ds, 

J u 

and that, on setting y = 2s/i, 

„ . . , , ^ coshs/i 1 2?/ 
i;^logsinh(s/i) = s-T-j— — = — [y + 



sinhs/x 2fi — 1 

Finahy, Dy{y + 2y / {e^ — 1)} has numerator proportional to {e^ — y)^ — 1 — y'^ > > 0. □ 



For weak ip, define g„ as in ()6.4() : while for the nearly-black case, set g„ = Qn- Let 
be positive constants to be specified. Since (|6.4() guarantees that Qn < ^, define 

TTn = (1 - qn)an- 

Proposition 6.4. Assume (Q) and (H). As before, let k'^ = nr]n (for io) or nrj^r^ (for 
weak £p). Define Tin = On(l — Qn) as above. Then, uniformly in jjL for which k{p) > Cnk'^, 
we have 

(a) for u < Onk'n, 

M{u; fi) > k'^^iU - //[7r„; ank'^]). (6.7) 

(b) In particular, for v = \, and an > 65(logn)~^, 

M(l;^) > c(logn)T-'^-^ (6.8) 

Remarks. 1. The lower bound of (|6.8() is valid for all sparsities r^ in the range 
[n~^ log''' n, n~*3]; clearly it is far from sharp for rjn away from the lower limit. If needed, 
better bounds for specific cases would follow from (|6.12|) and (|6.1H|) in the proof below. 

2. We shall need only the lower bound for u = \, but the methods used below would 
equally lead to bounds for larger z/, working from the intermediate estimate (|6.7j) . 

Corollary 6.5. Let k^ = nr^r^^ , and a„ as in assumption (A); then uniformly in jjL for 
which k{^) > ankn, 

M(l;/i) > c(logn)T~P-3/2^ 

Proof. For convenience, we abbreviate ank'^ by k. The bithreshold function g = g^^^^ is 
convex (Lemma 16. 3() . and so 

n K 

= ^g^^^{TTi) > ^g{TTi) > k'^gin^iiJ,)), 
1=1 z=i 

where Tt^^ilj) = (1/^n) X^i" the true positive rate defined at 1)6.61) . Since k{^) > k, 
Corollarv 16.21 bounds Ttn{^) > (1 — qn){K,/k'^) = TTn and so from the monotonicity of 5, 

g{Tt^,{n)) > g{lTn) = Pu{P-K{T^n)) > ^{tu " ^^(vrn))- 

This establishes part (a). For (b), we seek an upper bound for 

tu - IJ'Ki'^n) =tu -t^ + t^- fl^, (6.9) 
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where we abbreviate fJ-^i'^n) by /x^ for convenience. First note that 

which shows that ^{t^ — fin) < vr^ < 2^{tK — ^k) from which we get 

- /i« < ^(vr„/2) < ^21og(27r^i), 
using (|12.8j) . Since 7r„ = (1 — gn)an > C3/(log we conclude that 

tK - HK{T^n) < V2r loglogn + C4. (6.10) 

From (|12.14j) . we have 

0<U-t^< y^2\og{n/u) - ^2 log(n/K) + c(6i, 63)- (6.11) 

Since n = ank'^ with a„ < 1, the right side only increases if we replace k by /c^. 

At this point, we specialize to the case = 1 and set Vn = \/2 log n — y^2 log n/k'n- 
Combining (jHiHl), and (jHUni), we find that 

tu - AiK(vr„) < -Un + c(6i, 53) + 2r log log n + C4. 

For n > n{b), the last three terms are bounded by s„ = (2r + 1) log log n. So, from ()6.7() . 

M(l;/i) > A;;^(z;„ + s„). (6.12) 
We may rewrite k'^ in terms of Vn, obtaining 

logA:^ = f„\/21ogn - vl/2. 

The bound $(i(;) > cf>{w)/{2w) holds for w > V2; applying this we conclude 

k'n<^{Vn + Sn) > ; exp{t!„ ( i/2 log ?!-?;„- s„)}. (6.13) 

2{Vn + Sn) 

2 _ ,_ 1 

Since e~^"' = (logn) ^ 2 and Vn < yJ2 log n, the first factor is bounded below by 
co(log n)~^'"^. To bound the main exponential term, set g{v) = f(v^21ogn — v — Sn)- We 
note that nrjn G [log''' n, n-^~^3j ^nd so = 21ogr?n^ < 2 logn and so G [1,2 logn] and 
so k'n E [log'^'n, (2 log n)n^^''^]. For ^o[??n]i G [log'^ n, n^"''^]; we shall see shortly that the 
difference between the two cases doesn't matter here. 

We now estimate the values of v and g{v) corresponding to these bounds on k'^. At the 
lower end, k'^ = log'''n, then (using y/a — \/a — e > e/(2-y/a)), 

7 log log n 

Vn > =■■ Vln 

V21ogn 

and one checks that g{vin) > 7 log logn — 1 for n > n(6). At the upper end, if k'^ = 
(21ogn)n^~''3, then 

Vn<{l- Vh-i)\/2\ogn + C =: V2n 

and one checks that g{v2n) = {Vbs — ^>3)(21ogn)(l + o(l)) > g{vin) for n large. 
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Since g{v) is a concave quadratic polynomial with maximum between vin and V2n, it 
follows that for k'^ in the range indicated above, 

gSK) > gffKn) > e-i(logn)^. 
Combined with the bound on the first factor in we get 

as required for part (b). 

For part (c), we define a„ by the equation ak'^ = Unkn, so that a„ = UnT^'^^ > 
a„(21ogn)~P. Since a„ > 64(log n)~-^/^, we apply part (b) with r =p + 1/2. □ 

6.3 Properties of the Mean Detection Function 

This subsection collects some properties of 

I 

as a function of ly, considered as a real variable in R_|_. Writing M, M for partial derivatives 
w.r.t. k, calculus shows that 

dty/dv = -qn/{2n(j){U)) 

M{u- fi) = i-dU/du) cp{U - w) + ^{tu + w) (6.14) 

= (qn/n) e"^'^^ cosh(t^/i/) > (6.15) 
I 

M{u- fx) = -ql/{2n^(t>{U)) ^ ^^le'^''^ sinh(t,/xO < 0, (6.16) 

with strict inequality unless fi = 0. Finally, since M(0; fi) = 0, there exists D G [0, u] such 
that the threshold exceedance function u^^M{i'; /i) = i'~^[M{v; fi) — M(0; fi)) = M[i',fi), 
and hence, for each /i, the exceedance function is decreasing in z/: 

§-^{^)=lm.-,,)-Mi,,,)]<0. (6.17) 
Let us focus now on io[r]n]- In this case 

M(i/; fi) = ^[l>(t^ - /iz) + ^{-t^ - m)] + (1 - Vn)qnJy, 
1 

and so, using and (|12.1Uj) . 



M{u;fi) < h (1 - rin)qn- 



In particular, if u — cikn-j then 



,v. , s 20(0) 

at\ak„ 
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Finally, if a = = 1/(64 r^), then ()12.16() shows that for r/„ sufficiently small, 
As a result, uniformly in ^oi^n]; 

M{ankn;fi) <b^ + qn<q', (6.19) 
by the definition of 64; recall assumptions (A) and (Q) of Section f4. 11 □ 

6.4 Weak ipi Bounds for the detection function 

For weak ip, we do not have such a simple bound on M as (|6.19|) . From the preceding 
calculations, we know that M(iy; fi) is positive and decreasing. We will need now some 
sharper estimates, uniform over mp[r]n] (and -^ol^n]) in the scaling u = akn, with a regarded 
as variable. This will lead to bounds on the solution of M{akn',fJ-) = akn and hence to 
bounds on k{n) (cf. Corollarv 16. 8|) . 
The two key phenomena: 

(a) If oi is fixed, then for 1/ in intervals [aikn, a'^^kn], the slope of M is, for large n, 
essentially constant and equal to qn- This reflects exclusively the effect of false detections. 

(b) For small a (~ I/t,,, say), the order of magnitude of M{akn] fJ-) can be as large as 
l/{at[akn])- This reflects essentially the effect of true detections. 

Since fj, M{k;^) is an even function of each in, we may assume without loss of 
generality that /i/ > for each /. 

To bound M, divide the range of summation into three regions, defined by the indices 



Thus, we write 



M = Mpos + Mtrn + Mneg, 

where the sum in Mpos extends over the range [1, kn] of true "positives". The sum in Mtm 
ranges over (/c„,/c^] and is "transitional", while the sum for A'Ineg ranges over {k'^,n] and 
corresponds to means that are essentially true "negatives" . 

A rough statement of the results to follow is that for a in the range [yr^^, 1], 

sup Mposiakn, IJ,) -X . 

mp[rj„] at[aKn\ 
sup Mtrniakn, fl) = 0( } 

mM \aV[ak, 
Mneg{akn;n) ~ g„. 

Combining these will establish: 

Proposition 6.6. Assume (Q), (H) and (A). For n > n{b), > CQ,a > 1 and all /i G 

mp[7]n], 

qn[l - < Miakn, fi) < g„[l + c(6)5p(6„)] + [l + . (6.20) 
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If, in addition, a > 7r^ ^, then for -rf < rj{'y,p, 61, ^3) sufficiently small 

sup M{akn;n) > qn[l - e^] + r*^" y 



(6.21) 



The proof consists in building estimates for M in the positive, transition and negative 
zones. It win also be convenient, for Corollarv 16.81 below, to obtain estimates at the same 
time for the corresponding components of the detection function Af itself. 

6.4.1 Positive zone 

Mpos{v] /i) = X]f=i Pi^il^l) for fi = p,i, approximately constant on the interval [aikn, a^^kn]: 
for ai < a < aj~^, 

Mposiakn, fl) G [1 - ein, l]/cn. (6.22) 

Proof. The upper bound follows from Pu{l^i) ^ 1- For the lower bound, since ^{t^ — fli) is 
decreasing in I and increasing in 1^, we have for any li < kn, 

Mpos{v] fl)>Y^ Ht^ - fii) > IMty - fli) > li^t[6k 



Choose 'Jn = Tr]^ ^^'^ define li by the equation 

fii^ = t[5kn] + zi'jn). 
From (|12.16)) it is clear that fn^ > so that h < kn. Hence 

Mpos{v;fi) > ll^{-z{jn)) = /l(l -7n). 



We have from (|12.17j) that = + ^2 log + c(6i, 63) for rj small, and hence 



h 



t[5kn] + 2;(7n) 



> 



^ a/2 log + • 



> l-cr-iy2bi^ 



Since 7„ = the last two displays imply (|6.22j) with ei„ = cr^ ^Y^21ogT^. 
Bound for Mpos- Since Mpos{v; /i) = YX" ^{'^v - w) + ^{-tv - w), we have 



Mposiy, IJ) = {-dt^/du) ^ <i){ti, - m) + <i){ty + m) 



(6.23) 



□ 



(6.24) 



From ()12.10l) . we obtain Mpos{iy;fi) < 2/c„0(O)/(z^t,y). Hence, for = akn with tu > 1 and 
all /X, 

Mpos{akn;fi)<^P^. (6.25) 
at \aK^\ 

Turning to the lower bound, we note that fii > t^, if and only if / < nr^tZ^ =: ly. By setting 
all = ty for I <ly,we find from ^(^l^i and 1)12. lUj) that 

sup Mpos{i^; fJ,) > — 7 — . 
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(This also holds for ^o[^n]-) If 

u — fl^ji) then 

A /ly 1 



utv at[akn] 



t[akn\ 



Mtrn{akn]^j) < J\ (6.28) 



If a > 7Tj^ ^, then for r] sufficiently small, (112. 18() says T^/t[aA;„] > ^. Combining the last 
remarks, we conclude that for u = akn and a > 7t~^, then for rj sufficiently small, 

sup Mposiakn, fJ.) > J° y (6.26) 

mp[r]„] at[aKn\ 

Here cq denotes an absolute constant. 
6.4.2 Transition zone 

For a < a^^ and t] < rjQ sufficiently small, we have, uniformly in mp[r]n]: 

<Mtrniakn;fJ.) < CQT'^kn and (6.27) 

c 

at^ [akn 

To bound Mtrn{k; /i), introduce 

/i(/i,t) = e*^-^'/2 

which increases from 1 to a global maximum of e*'/2 = 0(O)/0(t) as /i grows from to 
t. We have from (|H.15|) that M{k;n) < 2[qn/n)Y^h{ixi,tk). The arguments for the two 
bounds run in parallel. We have 

k' h' 

Mtrn{i^;fJ') < y^^Hiim) and Mtrn{v;ij) < (gn/n) y^ff2(w), 

where Hi{fi) = 2'^{ty — ^) and H2{fi) = h{fiAt^,t^) are both increasing functions of /i > 0. 
By integral approximation, 

k' k' 

^ i^(w) < E^i(w) < ^r?^ y F(^.)n-^'-id^. 

For a < a^^ , we have t = t,, > - 3/2 for ?? sufficiently smah by (I12.1(i|l . while (|12.18|1 
shows that r^"^ > l/(2t^). Let F = sup H; we have 



Jl/{2t) 



after using Lemma 16.71 below. For Mtm, H = 2 and we have from the previous displays 

since t[akn]^^ < 2t^^ . 

For Mtrn, H = h{tu,ty) = (p{0)/(p{U), and since <p{U) > 

and so dlT^ follows from (ll:^.16|) . 
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Lemma 6.7. For p < 2 andt>2, and for h{u,t) given by either ^{t — u) or (j){t — u) / (j){t) , 
there is an absolute constant cq such that 

h{u,t)du CQh{t,t) 

Ji/m up+^ ~ tp+^ 

Proof. Writing v for t — u,we find that in the two cases h{u, t)/h{t, i) equals 2^{v) or e"""^/^ 
respectively. By ^T^, 2^{v) < e'^'/^ foj. y > 0, so in either case the integral in question 
is bounded by 

t-l/{2t) 

exp{-v^/2 + (p + l)g{v)}dv, 

where the convex function g{v) = log t — log(t — w) is bounded for < w < t — 1 by Av{logt)/t. 
Completing the square in the exponent, gives an integrand smaller than \/27r times a unit- 
variance Gaussian density centered at fJ-{p,t) = 4(p + l)(logi)/t. Since iJ,{p,t) < cq for p < 2 
and t>2, the previous integral is bounded by \/27r exp{^/i^(p, i)} < cq. □ 




6.4.3 Negative zone 

Under conditions described immediately below, 

[1 - ePJqnU < Mnegiiy; /^) < [1 + C<5p(e„ )](?„!/, (6.29) 
[1 - ePJqn < Mnegii^; < [1 + C<5p(e„)](?n. (6.30) 

The lower bound in (|6.29|) holds for all n,i/,fj,. All other bounds require ^ € mpi'Hn] and 
n > n{b). The upper bounds further require such that tjy > c{b). 

Lower Bounds. For Mnegii^'-, fJ-) = Yl,i>k' Pi^il^i) ^^i^ is simple because PuifJ-i) > Pu{0) = 
qnv/n, so that Mneg{v; fJ-) > (n- k'Jqni^/n = [1 - enjqn'^- 

For Mneg{v, /i), we set /(/i, t) = e~^ cosh(t//) and check that for given ci, ^ — > f{fi, t) 
is increasing for tji < c\ and t > yjc\. For I > k'^ we have 

tJ'i<fJ'i< P-k'^ = < citf^ < cit^\ (6.31) 
by H12.15p . for ci = ci(6). Consequently f{^i,ty) > f{0,t^) = 1 and so 

Mneg{v;ij) = {qn/n) ^ f{f^i,t,y) > {qn/n){n - k'J = [1 - eP]qn. 
l>K 

Upper Bounds. The arguments run in parallel: we have 

Mneg{v;^i) < ^ -H'i(w) and Mneg{v]^i) < {qn/n) ^ -^^2(^)7 (6.32) 

where Hi{p) = 2^(ti, — fi) and H2{p) = h{fi,ti^) are both increasing and convex functions 
of /i e [0, 1], at least when t > 2. Using (|6.31|) along with t = ty, this convexity implies 

H{^ll) < (1 - tfii/ci)H{0) + {tf,i/ci)H{ci/t) 

and hence, since t < ChTrj from (|12.15j) . 

^(W) < ^(0) + CkC^^H{ci/t) ■ r^n-i Yl (6-33) 
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By an integral approximation, since = iJ^Trj and k'^/n = en, we find 

Tr,n~^ ^ p,i< Tr^rjn / x'^^^dx = pe„ / = (5p(en). (6.34) 

;_!,/ ■^k'„/n Jen 

To apply these bounds to M„eg(z/, /x), we note that ^^^i(O) = qnv/n while from (|12.4I) . for 

t > 

From jnSl, and (jlT^ni) . we obtain (IfT^ . 

Turning to Mneg{v, fi), we note that -^2(0) = 1 and H2{ci/t) = exp{ci — cf/(2t^)} < e*^! 
and so the same bounds combine to yield (|6.3Uj) . 

6.4.4 Conclusion 

The upper bound in H6.20() follows by combining those in (|6.25|) . ()6.28|) and ()6.3U|) . The 
lower bound H6.21() follows by combining H6.26() and (|6.30j) . 

Corollary 6.8. Let dn = 2cqt~^ (where cq is the constant in (j6.27|) . ) Uniformly in mp[r/„], 

Hfi) <il-qn- dn)-^K. (6.35) 

Proof. Let s = {1 — qn — dn)~^- Combining the bounds on Mpos^Mtm and Mn^g in H6.22() . 
(|6.27() and H6.29() . we find for n > n{b) and sufficiently small that 

M{skn; A) < [1 + cqt^^ + Qnsil + Cfe(5p(e„))] A:„ < [1 + QnS + r„]A;„, 

where, since s < (1 — q')^^, 

Vn = cqt'^ + c5p{en) c = c{b, q'). 

We have 

M{skn;fL)/ (skn) < 1 - Qn - dn + Qn + rn = 1 - dn + rn- 

Since 6p{en) = o{t^^) (from the assumptions on r/„), clearly r„ — d„ < for n > n{b); for 
such n, M{skn] Ji) < skn, and so k{fi) < skn, as required. □ 

We draw a consequence for later use. Define 

^^^Uan + {l-qn)-']kn for ^^ 3^^ 

[ [an + {l-qn- dn) ]kn for mp[r]n]. 

Recall now the notational assumption (A). Clearly, for large n, 

Kn ~ (1 - qn^^kn, and Kn < kn/q". (6.37) 

From the remark after (|5.6jl (case 0„ = ^o['?n])i and from Corollary (|6.8j) (case B„ = mp[r]n]) 

sup k+{n) < Kn- 
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7 Large Deviation bounds for [kc, kp] 



We now develop exponential bounds on the FDR interval [kckp] that lead to a proof of 
Proposition l5.1l 'Switching' inequalities allow the boundary-crossing definitions of kc, kp to 
be expressed in terms of sums of independent Bernoulli variables for which large deviation 
inequalities in a 'small numbers' regime can be applied. 

7.1 Switching Inequalities 

We will write Yi for the absolute ordered values |?/|(;). Let 1 < kc < kp < nhe respectively 

the smallest and largest local minima oik^Sk = Er=fe+i + Eti if for < A; < n. The 
possibility of ties in the sequence {S^} complicates the exact description of local minima. 
Since ties occur with probability zero, we will for convenience ignore this possibility in the 
arguments to follow, lazily omitting explicit mention of "with probability one" . 
Define the exceedance numbers 

N{tk) = #{i : \yi\ > tk], N+{tk) = #{i : \y^\ > tk}. 

Clearly A^(tfc) and A^_|_(tfc) have the same distributions. We now have 

kp = max{l : Yi > ti} = max{/ : N{ti) > 1} (7.1) 

fee + 1 = min{/ : Yi < U} = mm{l : N+{ti) < I}. (7.2) 

[We set kp = or kc = n if no such indices / exist.] To verify the left hand inequalities, 
note that 5*^ — Sk-i =t\ — Y^, so that 

Sk > Sk-i <^ Yk <tk. 

The largest local minimum of St occurs at k = kp exactly when Sk < Sk-i but Si > Si^i 
for alH > k. In other words > but Yi < ti for all larger I, which is precisely ()7.1() . 
Similarly, the smallest local minimum of Sk occurs at k = kc exactly when Sk+i > Sk but 
Si < Si-i for all / < k, and this leads immediately to ()7.2() . 
For the right-hand inequalities, we simply note that 

N{tk) > /c iff Yk > tk, and N+{tk) <k iff Yk < tk- 



7.2 Exponential Bounds 

First, recall Bennett's exponential inequality (e.g. PollardI ()l984l . p. 192)) in the form 
which states that for independent, zero mean random variables Xi, . . . ,Xn with < K 
and V = ^Var(Xi), 

,Kr], 



P{X^ + . . . + X„ > r?} < exp{-|^B(-^)}, 



where B{X) = (2/A^)[(l -|- A) log(l + A) — A] for A > is decreasing in A. We adapt this for 
settings of Poisson approximation. 

Lemma 7.1. Suppose that Yi,l = 1, . . . ,n are independent 0/1 variables with PiYi = 1) = 
Pi. Let N = Y!l^i <^nd M = EN = YXpi- Then 

P{N <k] <ex^{-\Mh{k/M)] ifk<M, (7.3) 

P{N >k} <exp{-lMh{k/M)} if k > M, (7.4) 

where h{x) = min{|x — 1|, |x — Ip}. 
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Proof. In Bennett's inequality, if M > k, set Xi = pi — Yi and r] = M — k. li M < k, put 
Xi = Yi — pi and rj = k — M. In each case K = 1, and F = VarN = — Pi), and we 

write = [t]'^ /V)B{r]/V) for the term in the exponent. 

Suppose first rj/V < 1. Since B is decreasing, B{r]/V) > B{1), and since V < M, we 
have rj/V > rj/M. Thus E > B{l)r]^/M. 

Now if r]/V > 1, we note from some calculus that the function C(A) = XB{X) is 
increasing for A > 1. Hence {r]/V)B{r]/V) > B{1) > 5(1) min(l, 7?/M). Thus E > 
5(l)min(r/2/M, 77). 

Now B{1) = 2[21og2 - 1] = .77 > 1/2. So, in either case E > ^ mm{r]^ / M , r]) = 
{M/2)h{r]/M) and r//M = \ {k/M) - 1|. □ 

7.3 Bounds on k/Mk 

The Lipschitz properties of /c — > M{k; ^) established in Section El are now applied to derive 
bounds on the ratios k/M^ appearing in the exponential bounds (|7..S|) - (|7.4j) . In the following, 
we use b to denote the vector of constants b = (61, 64, g'), and the phrase n > no(6) to 
indicate that a statement holds for n sufficiently large, depending on the constants b. 

Proposition 7.2. Assume hypotheses (Q), (H) and (A). If Unkn < ^1, then for n > n{h), 
uniformly in fj, £ Iq [rjn] and mp [rjn] , 

M{ki + ankn, /i) - M{ki]n) < q'unkn- (7.5) 

(a) If k{pL) < fcl < (1 — q')^^kn and ^2 = ^1 + oinkn, then 

^2 -, ^ 



1> (l-g')'an=:a„. (7.6) 

(b) There is C, > Q so that, if 2ankn < ^(m) ^ (1 ~ Q')^^kn and ki = k{fi) — ankn, then 

1-^>{1- q'fan > Can. (7.7) 

Proof. Formulas (|(i.l5|) - (|(i.l6j) show that AI{v] fj,) is positive and decreasing and so 

the left side of 1)7. 5() is positive and bounded above by M{ankn; /i). For fj, £ io[r]n], (|fc).19() 
shows that for n > n{b), M{ankn',n) < q' for all fi in io[rin]. That the same bound holds 
uniformly over mp[r/„] also is a consequence of ()6.2U|) and 1)6. 18() . 

(a) To prove (|7.6j) . note that the assumption k{fi) < ki entails M^^ < fci, so from (|7.5|) . 

Mfc2 = Mk, + Mfc2 - Mk, <ki+ q'ankn. 

Since ki < kn/ (1 — q') and q' < 1, we have 

^ ki + g'«nfcn ^ 1 + g'(l - g')«n 



^2 h + ankn 1 + (1 - q')an 
Thus, since = 0(l/-v/log n), for n > n(6), we find 



A/fa + 

(b) The assumption that /c(/i) > 2a„/c„ yields /ci > a„/cn, and so Lipschitz bound ()7.5|) 
implies M^^^) — M^^ < q'ankn. Hence, since k{ij,) < kn/{l — q'), 

^ fc(/i) - q'ankn ^ 1 - q'{l - q')an 
ki ~ k{ii) - ankn ~ 1 - (1 - q')an 
which leads to (|7.7j) by simple rewriting. □ 
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7.4 Proof of Proposition 15.11 

1°. Let ki = k(ij,) V a„/cn and k2 = ki + Onkn- From (|7.1|) . 

{kp >k2}C U {N{ti) > 1} (7.8) 

For I > k2> k{fj,), we necessarily have E^N{ti) = M{1; fj,) < I, and so from Lemma l7. II 

Pf.{N{ti) >l}< exp{-iM, h{l/Mi)}, Ml = M{l;fi). (7.9) 

For X > 1, the function h{x) is increasing, and for I > k2, I ^ I /Mi is increasing and so 
h{l/Mi) > h{k2/Mk^). Now ki and k2 satisfy the assumptions of Proposition I7.2f a) and so 
from (|7.6j) . h{k2/Mk2) > C'^c^n- Since / — > is increasing, we have from Proposition 16.41 
that 

Ml > M(l; fi) > c(log ny-^/^. (7.10) 
Combining (fTH]) . fT^ and (fHTH) . we find 

i'^i^F > A;2} < X] exp{-iMiC'a^} 

< nexp{— ca^(logn)'''~^/^} 

< nexp{-c'(logn)^-^/2}^ (7 ;^^) 

for c' depending on q' and 64. 

2°. Now assume that A;(/i) > 2ankn', we estabUsh a high probabiUty lower bound for 
kc- Let ki = k{fj.) — Onkn] from H7.2() 

{A:g < fci} = {A:g + 1 < A:i} C [j {N+{ti) < I}. 

l<ki 

For I <ki < k{fi), we necessarily have Mi > I and so 

P{N+{ti) <l} = P{N{ti) <l}< exp{-lMih(l/Mi)}. 

Since / I /Mi < 1 is increasing, and since ki and k{ij) satisfy the assumptions of Propo- 
sition [7]2tb), we obtain from (|7.7|) that 

MW)>(l--|^)^C^ai 

In addition / Mi is increasing, and so Mi > Mi. Since k{^) > 2ankn > ctnkn, we have 
from Proposition 16.41 that 1)7. 1U() holds here also. Hence 

Pi.{kG < ki} < kieM-lMiC^al} < n exp{-c'(log n)^-5/2|^ 
in the same way as for 1)7. 

8 Lemmas on Thresholding 

This section collects some preparatory results on hard (and in some cases soft) thresholding 
with both fixed and data dependent thresholds. These are useful for the analysis and 
comparison of the various FDR and penalized rules, and are perhaps of some independent 
utility. 
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8.1 Fixed Thresholds 

First, an elementary decomposition of the ir risk of hard thresholding. 
Lemma 8.1. Suppose that x ~ N{fi, 1) and that r]H{x,t) = xl{\x\ > t}. Then 

PH{t, fJ-) = E\riH{x,t) — = / \fi\'^(j){x — fi)dx + / \x — fi]"^ (f>{x — iJ.)dx (8.1) 

J-t J\x\>t 

= D{fi,t) + E{fi,t), (8.2) 

where 

Difi,t) = \f,\-m-^)-^-t-f,)] (8.3) 

E{n, t) = \t- ^r"V(i - Ai) + |i + Mr"V(i + /w) + e{t e{t + fi) (8.4) 

/•oo 

|e(t;)| = |r - 1| / z''-'^(j){z)dz <v'"^(l){v), v>0,0<r<2. (8.5) 



We note that for < r < 2 

/oo 
zXz)dz = 2r$(t)(l + et~^), 0<9<r, (8.6) 

and i/iai E{fi,t) is (i) globally hounded: < E{fi,t) < Cr = j \zY(j){z)dz, 
(a) increasing in fi, at least for < /i < t — \/2; 
(Hi) satisfies E{l,t) < cof'^{t - 1) for t > 1. 

A consequence of 1)8. 6() is that for some \62\ < 1, 

k 

Y,tf = ktlil + etf) + 02ti = ktlil + o(l)), (8.7) 
1=1 

so long as k ^ oo and k/n — > 0. 

Proof. The risk decomposition is immediate. For 1)8. 6|) . by partial integration 



oo 

r-l, 



z''(i){z)dz = f^t) +r J z''-'<P{z)dz, 

from which the lower bound is clear. For the upper bound, use p2.2|) and r < 2 to find 
that the second integral is bounded by 

oo 

z''^^(l){z)dz < f-^^{t). 
Property (i) of E is evident, while for < /i < t, one finds that 

dE/dfi = {t- /u)Xt -ti)-{t + /i)Xt + n), 



and the latter difference is positive if {t — fi) > 2. For property (iii) we note that E{1, t) < 
2 z'^(j){z)dz and then appeal to (|8.6() . 

Turning to (|8.7|) . by integral approximation, we have for < ^2 < Ij 

^fl=Y^ z''{ql/2n) = {2n/q) / v'-^{v)dv + 02f^. 
1 1 -^tk 

From ()8.6() . the right hand integral term equals A:t^ ~ + ^kt^^'^ from which the result 
follows. □ 
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The next lemma, on covariance between the data and hard thresholding, is simpler in 
the £2 case. The ir analogs are postponed to Section 111.31 

Lemma 8.2. Let x ~ 1). (,{t, n) = E^{x — ^i)[r]H{x,t) — fi] has the properties 



(i) ^{t,^l) = t[4>{t-^i) + (t>{t + ^l)] + ^t-n) + <^{-t-^^). (8.8) 



(ii) at,^^)<\ ; - ^ ^ ' (8.9) 
I t + 1 jor aLL fi. 

(iii) fi £,it^^J') is symmetric about 0, increasing for < /i < i, (8.10) 
and convex for < n < t/3, if t > 3. (8-11) 

(iv) sup \C^,|,{t,fL)\ <co. (8.12) 

\ti\<t/3 



Proof. Formula 1)8. 8() follows by direct evaluation, and inspection shows that ^{t, /i) is sym- 
metric about /i = 0. The global bound in (|8.9j) follows since 20(0) < 1. If < ;U < 
t — -v/21ogt, then 

^{t, /i) < 2t(j){t - /i) + 1 < 2t,/>(A/21ogt) + 1 = 20(0) + 1 < 2. 

For the monotonicity, writing for 

^^{t,^l) = c|){t-^i)[t{t-^i) + l]-c|){t + ^i)[t{t + ^i) + l]. 

Since 0(t - /u)/(/>(t + /i) = e^*^ > 1 + 2tfi, it follows for < t < /i that 

C^(t, /i)/0(t + Ai) > (1 + 2tfi) [t{t -fx) + l]-t{t + fi)-l = 2t^fi{t - /x) > 0. 

Finally, differentiating again with respect to fi yields 

^M/^(i,/^) = g{f^;t) -g{-i2;t), 

where g{fj,; t) = [t{t — fif — fj](j){t — fi) and 

g^,{fi; t) = [{t - /i)((t - fift -2t-fi)- l]0(t -fi)>0 

if t > 3 and |/i| <t/3. For such /u, ^^^.{t,^) > 0, which establishes the convexity claimed in 
dHnU) . Finally, for < ^ < t/3, 

C^^it, //) < ff(t/3; t) < t^(l){2t/3) < CO. □ 
8.2 Data-dependent thresholds 

Lemma 8.3. Let x = fj, + z ^ -^(a*) 1) o^^d r]{x, t ) denote soft or hard thresholding at t. 
For r > 0, 

i??(x,t)-Air <2('^-i)+(izr + i£r). (8.13) 

Proof Check cases and use \a + < 2(^-i)+(|a|^ + l^l*-). □ 

Lemma 8.4. Suppose that y ~ Nn{fJ,,L) and that (i{y) corresponds to soft or hard thresh- 
olding at random level t : fi{y)i = r]{yi,t). Suppose that t <t almost surely on the event A 
(with t > [S|zp^]^/2rj^ rj.^^^ ^ ^ 

E^[\\fi-f,f,A] < 2'^^i/VnP^(^)V2. (8.14) 
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Remark: The notation E[X, A] denotes EX I a where I a is the indicator function of the 
event A. 

Proof. Rewrite the left side and use Cauchy-Schwartz: 

n n 

Y,E[\v{y^,i)-^^^\'',A]<p{Ay/^Y.iEU-^'^\''^^'^^'^'■ 
i=i 1 

Now (|8.13|) and the bound on t imply 

Continuing with y ~ Nn{n, I), the next lemma matches the ir risks of two hard thresh- 
old estimators il{y)i = r]H{yi;t) and jl' with data-dependent thresholds t and r if those 
thresholds are close. Assume also that there is a non-random bound t such that t,t' < t 
with probability one. Then 



IvHiviJ) -VH{yi,i')\ < 



t if lies between t, 
otherwise. 



Let A'"' = ^{i : \yi\ € clearly — < t^N'. In various cases, N' can be bounded 

on a high probability event, yielding 

Lemma 8.5. Let /?„ be a specified sequence, and with the pTGvious dGfifiitioTis , set B^i — 
{N' < /?„}. For0<r<2, 

< 2P^t^ + rI{r > 

I/O (8.15) 
Proof. To develop an ij. analog of ()9.21() . we note a simple bound valid for all a, z € M: 



a -1- z — a 



0<r<l 
r( \a\ + \z\Y~^\z\ 1 < r. 



<r; (^-le) 



[For r > 1, use derivative bounds for y \y\'']- We consider here only 1 < r < 2: the case 
r < 1 is similar and easier. Thus, setting e = jl — jl\/S. = ji — ^ and similarly for A' 

\E{Y, I A.r - |A^r', Bn]\ < rE{Y^ \A',r'\ei\ + \e^\\ Bn}. 

i i 

Using Holder's inequality and defining e„ = i?{||/U — B^}, we obtain 

\E{\\A\\:-\\A'\\:,B.^}\<rp{fL',f,)('-'^/'el/^ + re^. (8.17) 

From the definition of event Bn and the remarks preceding the lemma, e„ < On i?^, 

apply Lemma IH31 to obtain (|8.15|) . □ 
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9 Upper Bound Result: ^2 error 



We now turn to the upper bound, Theorem 14. 11 We begin with the simplest case: squared- 
error loss. Only the outline of the argument is presented in this section, with details 
provided in the next section. The extensions to ^r error measures, of considerable impor- 
tance to the conclusions of the paper, are not straightforward. The proofs are postponed 
until Section [TTl 

The approach taken in this section was sketched in the introduction, see and 
H1.13() . We define certain empirical and theoretical complexity functions - the empirical 
complexity being minimized hy ji2- A basic inequality bounds the theoretical complexity 
of jl2 by the minimal theoretical complexity plus an "error term" of covariance type. When 
maximized over a sparse parameter set 0^, the minimum theoretical complexity has the 
same leading asymptotics as the minimax estimation risk for 0„. To complete the proof, 
the error term is bounded. This analysis is sketched in Section ESI the detailed proof relies 
on an average case and large deviations analysis of the penalized FDR index k2- The upshot 
is that these terms are negligible ii q < 1/2, but add substantially to the maximum risk 
when q > 1/2; this was foreshadowed Proposition 15.51 and its discussion. Finally, a risk 
comparison is used to extend the conclusion from the penalized estimate fi2 to the original 
FDR estimate ftp- 



9.1 Empirical and Theoretical Complexities 

In Section fl. 81 we have defined fi2 as the minimizer of the empirical complexity i^(/U, y) = 
\\y — + Pen(fl) (note that now we set o"^ = 1). Substituting y = + z into K{fi2,y) 
yields the decomposition 

K{fi2,y) = K{il2,fi) + 2{z, ^ - A2) + \\zf- (9.1) 

Now let ^0 = Mo(^) denote the minimizer over jl of the theoretical complexity K{jl,ii) 
corresponding to the unknown mean vector fi : 

iC(/io,^) = inf||/i-^||2 + Pen(/i) (9.2) 

There is a decomposition for K{p,Q,y) that is exactly analogous to 1)9. 1|) : 

K{no,y) = K{i^o,fj,) + 2(z,^ - ^0) + 

Since by definition of /l2, K{fj.Q,y) > K{fi2,y), we obtain, after noting the cancellation of 
the quadratic error terms and rearranging, 

K{fi2,fJ') < K{fio, /i) + 2{z, (I2 - fJ-o). (9.3) 

Thus the complexity of fl2 is bounded by the minimum the oretical complexity plus an erro r 
term. Up to this point, the developm ent is close to t hat q f lDonoho and Johnston j ( 1994^1 ) . 
as well as work of other authors (e.g. IVan de Geei (|l99(]l l). Since 

K{il2,fJ-) = \\fi2-l^f + Pen{fL2), (9.4) 

we obtain a bound for p{fi2,ij) = -E^||/i2 — by taking expectations in (|9.3|1 . Since the 
error term has zero mean, we may replace fiQ by fi and obtain the basic bound 

p(A2, /^) < K{fio, n) + 2Ef,{z, (12- ij) - E^Pen{fi2). (9.5) 
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We view the right-hand side as containing a 'leading term' i^(/io,/u) - the theoretical 
complexity - and an 'error term' 



Err2{fi, fi2) = 2E^{z, fL2 - fi) - E^Pen{fi2). (9.6) 

We now claim that the maximum theoretical complexity over sparsity classes 0^ is 
asymptotic to the minimax risk. 

Proposition 9.1. Assume (Q), (H). 

sup K{fio{fi),fj.) Rn{@n), n ^ oo. (9.7) 

The same minimax risk bounds the error term: 

Proposition 9.2. Assume (Q), (H). With U2p = 1 for Iq and strong £p, and U2p = 1 — p/2 
for weak £p. 

Mee„ [ o{Rn{Qn)) q < 1/2. 

Together, these propositions give the Upper Bound result in the squared-error case. 

9.2 Maximal Theoretical Complexity 

We prove Proposition 19.11 beginning with the nearly-black classes 0^ = 

As in Section Fl. 81 decompose the optimization problem 1)9.2(1 defining K{^q,ij,) over the 
number of non-zero components in fl. Assign these to the largest components of fi: hence 

n k 

K{flo,^,) = inf J2 4) + E*'- (9.9) 

i=k+i 1=1 

On £o[r]n], at most kn = [nrjn] components of /u can be non-zero. Hence the infimum over k 
may be restricted to < k < kn- This implies 

sup Kiiio,fi) = Y,tl (9.10) 

Indeed, choosing /c = /c„ in ()9.9|) shows the left side to be smaller than the right side in 
(|9.1U|) . while equality occurs for any /u with non-zero entries /ii = . . . = fi^n > ^i- Noting 

k 

Y,ti-ktl t2^21og(f|) iffc = o(n). (9.11) 

1 

(cf. (|8.7|) and Lemma [12. along with rjn = 0{n~^), we get 

kn 

^tf ~ kntl^ ~ nr]n • 21ogr?^^ ~ ^n(4[??n]) 
1 

which establishes ()9.7|) in the case. 

Remark. Using a fixed penalty Penfix{n) = t^||Ai||o in the above argument would yield 
sup IT = knt^ ~ nrjnt^, but the t^ term is unable to adapt to varying signal complexity. 
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Weak ip. The maximum of (|9.9() over /i e mp[r]n] occurs at the extremal vector /i; = C„Z 
where C„ = n^/^-qn. Define fc„ to be the solution of C^fc^^^^ = <^ . Using H9.11|l . we obtain 

n k 

sup if (mo, = inf Cl J2 l'^'^ + E 

= (l + rp)fc„4„. (9.12) 
Thus fc„ is the optimal number of non-zero components and may be rewritten as 

k^^Clt^^ni^^^tll (9.13) 

A little analysis using H9.11|l and the equation for shows that t| ~ 21ogr7~^. [For this reason, 
we define fc„ = nrjP^T^P ] From (|3.4|) we then conclude ^ l^n' which via H3.6|) and H3.7|l shows 
that the right side of (|9.12|l is asymptotically equivalent to i?(mp [?]„]), as claimed. 

Remark. The least favorable configuration for /i is thus given by /i; = min(C„Z^^/^, i/) — 
min(77„(//ri)"^/P, ti), which, after noting H9.11() . is essentially identical with the least favorable dis- 
tribution H3.8|l . In addition, the maximisation has exactly the same structure as the approximate 
evaluation of the Bayes risk of soft thresholding over this least favorable distribution; compare (|3.9|l 
- (|3.11ll . Replacing the slowly varying boundary I ti hy rrifc — i[fc„] +a leads to the configurations 



Strong ip. The maximal theoretical complexity is the value of the optimization problem 

{Q{n, V?J) maxE min(M^;), i^) subject to ^ ^ "^Vn- 

e 

The change of variables xi = ii^^^ allows to write this as 

max^^ min(a;^^^, t^) subject to Xi < nr]]^, Xi > X2 > ■ ■ ■ ■ 

e e 

Since p < 2, the objective function is strictly convex on Il^fOjt^], and the constraint set is convex. 
The maximum will be obtained at an extreme point of the constraint set, i.e. roughly at a sequence 
vanishing for £ > A: (for some k) and equal to for £ < k. Let fc„ be the largest k for which 



£=1 



Then the maximal theoretical complexity obeys 

fe„ k„ + l 

Y,tl<val{Q{n,Tj?,)) < E ^^ 

Again using (|9.11|) . we get 

So ()9.7|l follows in the £p[rin] case. 
9.3 The Error term 

We outline the proof of H9.8|) . Recall the definitions k± and t± from Section 5, at equations 
(|5.9I) . (|5.8|) . (|5.11|) . and their use in Proposition 5,1. We will rely on the fact that it is 
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0„-likely under that ^2 ^ ^+(^) and hence that ^2 ^ ^-(a^)- First write 

(2:,/i2 - ^) = '^Zi['nH{yi,i2) - pti]- (9-14) 

1 

We exploit monotonicity of the error term for small components ^i. Indeed, if < 
t-{n) < t2, then (cf. Lemma FlU. 1(1 

Zi[m{yi,i2) - i^i] < Ziimivi^t-iiJ')) - i^i], (9-i5) 

as may be seen by checking cases. This permits us to replace t2 by the fixed threshold value 
for the vast majority of components fii, with Proposition 15.11 providing assurance 
that t2 > t-{pL) with high probability. We recall the defiinition of the covariance kernel 
in Lemma 18.21 For zi ~ A^(0, 1) and scalar mean fii, 

= E zi[r]H{zi + fii,t) - fii]. 

The function is the covariance between yi and r]Hiyi,t) when the data yi ~ 

N^ni, 1). Lemma 18.21 shows that ^ is even in fii, has a minimum of 2t</>(t) at fii = 0, rising 
to a maximum near fii = t (though always bounded by t + 1), and dropping quickly to 
1 for large /Ui. It turns out that, uniformly on nearly-black sequences ^ E G^, the main 
contribution to the sum comes from components fii near 0. 

Similarly, it is 0„-likely that 

1 

We proceed heuristically here, leaving the (necessary!) careful verification to Section 
[TUl Interpreting "si" to mean "up to terms whose positive part is o(i?n(0))", we have, 
uniformly on 0„, 

Err2iii, A2) ~ 2 ^ at-il^), ^^^) " (9-16) 
« 4nt.{fi)<P{t.{fi)) - k-{^)t\{ii) (9.17) 
«M(t_(^))-A;_(^)]4(/.) (9.18) 
« (2g„-l)A:_(^)4(^). (9.19) 

At 1)9. 17(1 we have first used the fact that ^(t,0) = 2t(p{t). Second, for the at most kn non- 
zero terms, we use the bound ^(t, /i) < t + 1 < ti + 1 - compare 1)8. 9() - and note that their 
contribution is at most 0(/c„ti) = o(i?„). At (|9.18() we used (pit) ~ t^{t) as t ^ 00, and at 
1)9. 19() the definitions of t^{p), and the asymptotics of each. 

Expression 1)9. 19(1 is negative if qn < 1/2, making Err2 for our purposes negligible. For 
Qn > 1/2, since k — > ktj. is increasing, cf. 1)12. 11() . we use the bound of 1)5. 6() . namely 
kin) < k n on io[T]n], along with (|9.11|) to conclude that (|9.19|) is not larger than 

i2qn-l)ktl^^^^Rniio[Vn]). (9-20) 

J- Qn 

This motivates 1)9. 8|) in the Eq case. 

Weak ip The outline is much as above, although there is detailed technical work since all 
means may be non-zero (subject to the weak-^p sparsity constraint); for example in the transition 
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(|9.16() to (|9.17() . Again, after (|9.19(l we are led to maximize over TOp[?y] and from H5.7|l . find 

^-(m) ^ ^(m) ^ kn/(\ — qn) — fc, say. Here kn = nrjPT~P is the effective non-zero index for weak £p 
defined after ^J^. 

Consequently, since ~ t^, and using the expressions and (|3.7|l for minimax risks, we 

obtain 

(29„ - l)A:_(/i)4(Ai) < (2<7„ - 1)H|(1 + o(l)) 

1 _ g " "J 1 _ g "''?«^') 

~ ^i?„(^p[?7n]) ^ U2p ■ ■ R{mp[r]n]), 

I — q 1 — q 

with U2p = (1 — p/2). 

Strong £p The inclusion £p[rin\ C mp[r/„] and the previous display give 

(2g„ - l)k.{fi)tl{fi) < ^^!i^i?„(^p[r,„])(l + o(l)). 

If the above arguments were complete - rather than just sketches - we would now have 
the right side of (|9.8|) in Proposition 19.21 Details to come in Section [Till 

9.4 Prom penalized to original FDR 

We extend the adaptive minimaxity result for the penalized estimator fi2 which thresholds 
at t2 to any threshold t in the range [tp, Iq] defined in Section [1.81 In particular the adaptive 
minimaxity will apply to the original FDR estimator fip- 

First compare the squared error of a deviation §2 = 1^-2 — fJ- with that of 6 = fi — fi: 

Ml - \Ml = \\f^ - A2lli + 2(/i - fi2) ■ ifi2 - (9.21) 

Now suppose fiD for "data dependent) has the form (|4.1|) All such estimators differ 

from jl2 at most in those co-ordinates yi with kc < I < kp, and on such co-ordinates the 
difference between the two estimates is at most tc < ti = z{q/2n). Hence 

ll/i - hWl < tiikp - kc). 

Proposition 15.11 and (|5.1()|) show that FDR control, combined with sparsity, forces the 
"crossover interval" [kc kp] to be relatively small, having length bounded by ^Unkn- 
On the event described in Proposition 15.11 we have 

IIA - /i2||i < lankntj = o(i?n(e„)). (9.22) 

We summarize, with remaining details deferred to Section [11.41 

Theorem 9.3. If /Id satisfies (|4.H) . then for each r G (0,2] 

sup |/j(/iz),/i) - p(/ir.,/u)| =o(ii„(e„)), 

SO that asymptotic minimaxity of fi^ implies the same property for any such fip). 
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10 Error term: Quadratic Loss 

We now formalize the error term analysis of Section 19.31 collecting and applying the tools 
built up in earlier sections. 

Lemma 10.1. // |^| < < , then 

{x - n)[r]H{x,f) - ^l\<{x- n)[rjHix,t^) - /u]. 
Proof. The difference RHS - LHS equals 

(x - fi)[mix, t^) - mix, i^)] = {x- n)xi{t^ < |x| < t^} > 0, 

since sgnx = sgn(x — /i) if |x| > > \lJ.\. □ 
We proceed with the formal analysis of the error term 1)9. 6() . Set 

Ci = ei(t2) = 2(yj - fJ-i)[r]H{yi,i2) - fJ^i], 

and 

An = Anin) = {t- < t2 < t+}, 5n(/i) = {i : < t-}. 

We have 

2E{z, fi2 - fi) = E^e, = E[Y^ ei,^„]+^[^ ei, A^] + E[^ei, A^^]. (10.1) 

= Dan + T2n + T3n, (10.2) 

where we use Dan,Dbn etc to denote 'dominant' terms, and Tjn to denote terms that will 
be shown to be negligible. 

Let ei = ei{t-) : the monotonicity of errors for small components (shown in Lemma 
110. 1|) says that the first term on the right side is bounded above by 

E[ ei, An] = E[Y, di] ei, ^n] = ^fen + J4„. 

5„(/i) Sn(p) Sn{lJ.) 

Recalling the definition of (,{t,fi) from Section IS?^ we have Eci = 2^(t_,/Xj) and 

Dhn = 2\Sn{i^mt-,o)+2 m^,f^i)-Cit^,0)] 

Sn{^J.) 

<2ne(t-,0)+ri„(/i), 
say. To summarize, we obtain the following decomposition for the error term 1)9. 6(1 : 

4 

Err2ifJ., fi2) < DcnifJ-) + y^^jn(/»), 

i=i 

where 

DM = 2<(t-,0) - E^Pen{fi2). 

Recall that Rn{@n) ^ knT^ for both loiVn] and mp[r/„]. In the following, we will show 
negligibility of error terms by establishing that they are 0(A;„t^), (or, in one case, o(/i;„t^)) 
uniformly over ^o[^n] or mp[r]n] respectively. 
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Dominant term. Using 1)12. 

0) = 2t(l){t) + 2^{t) < 2t'^^{t) + 6$(t). 
Since 2^(t_) = qnkj^n~^, we obtain 

2<(t_, 0) < 2qnk+ti + 6qnk+ < 2qnk^ti + cA;„r^, (10.3) 

after observing that /c+ — /c_ < 2>ankn < cknT~^ by (A), and that t_ < cr^ from (|12.15() . 

For the penalty term in Den, we note that on An, k2 ^ k^, and so Pen{jl2) = tf > 
k-ti^ > k-t^_. On the other hand, since An is 0^,— hkely, 

1 

As a result 

E^Pen{fi2) > k^tl + 0(A;„r^). (10.4) 
Combining (|10.3|) and (|10.4|) . we obtain 

DM < {2qn - l)k^tl + 0{knTr,). 

If Qn ^ 1/2, then of course the leading term is non-positive, while in the case 1/2 < < 1, 
we note from the monotonicity of /c — > ktf. (cf . 1)12. 11|) ) and the definition ()6.37p of Kn that 

k^tl < k+t^[k+] < Knt^lKn] ~ (1 - gn)"^A:„r2, 

which leads to the second term in the upper bound of ()4.2|) . 



Negligibility of Tin — T^n- Consider first the term Ti„(/i). For the non-zero /x/ (of which 
there are at most nrjn), use the bound 1)8.9(1 to get 

Tln{^J) < kn{ti + 1) < knTr,. 

For the large signal component term T2n, we have, using Lemma |8.4I and the bound 

t2 <tl =0(l0g^/2^), 

T2n < E[e,,An] < 2 J2 {E[r]{y^,i2) - fi^f}'/^ < coti\S'^{fi)\ < |5;;(/.)|(cologn)V2. 

(10.5) 

On £o[r]n], clearly |S'^(//)| < nr/^ and so T2n{p) < ciA;„r^. 

For the small threshold term T^n, note first that — 2||-2||||A2 ~ /^||) so that 

Tsn{fi)<2{E\\zfy/HE[\\fi2-f^f,A^J}y\ 

Now -Ell^lp = n, and since A'^ is a rare event, apply Lemma 18.41 noting the bound t2 < ti. 
Thus 

T3„(^) < 8ntiP^(^^)i/^ < cintiexp{-C2log2n} = o(i?n(4[??n])). 
uniformly on ^o[??n] after applying Proposition 15. II 
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The remaining term is handled exactly as was T^n ■ if we let fi p denote hard thresh- 
olding at the (fixed) threshold t_, then X^^^^^) e, < — ^||; and now Lemma l8.4l and 
Proposition 15. II can be used as before. 

Weak Ip. A little extra work is required to analyze Tin{fi), so we first dispose of 
— T^n- The analysis of T^n and T^n is essentially as above. For T2n, we bound |S'^(//)| 
using the extremal element of mp[rin], namely fli = r]n{n/iy^P . Thus, for all fi G mp[r]n], 

S'Mc{l:fli>Ufi)}, 

and since, for r] sufficiently small, 

t_(^) = t[k+{f,)] > t[k+{fl)] >Tr,- 3/2 = r^(l + 0(1)), (10.6) 

we have 

\S'M < n7l?,rj{f^) < nrf^T-P{l + o(l)). (10.7) 

From (|3.6|1 we have Rn = R{£p[rjn\) ~ nrinT^~^, and so from (|10.5|) and (|10.7|) . we get 

72n(^) < CohUTJ^T'^ < ck nTrj- 

For the Ti„ term, we obtain from Lemma 18.21 that 

({t,f,)-^{t,0) <c{fi^ At), 
and so Tin{n) < 2cJ2 P-f ^ ^i- The negligibility of Tin is a consequence of the following. 
Lemma 10.2. For < p < r < 2, we have 

j;/Z[Atf-^)+=o(fc„r;). 
Proof. Define khj flj^ = r^/logr^ so that k = nr/^r^^ log^ r^. We have 

n 

Y,fiiAtt'^-<kt'^- + Y.-^i- 

'=1 l>k 

Since ti < CTrj, 

U;-^^^ < ch,Ti^-'^+ logPr, = o(fe,r;). 
By integral approximation, since < p < r. 



i>k -^^Z" 



CrpUrj^T^ PlogP ""r^ = o(A:„r^). □ 



11 £r losses 

This section retraces for ir loss the steps used for squared error in Section |^ making 
adjustments for the fact that the quadratic decomposition (|9.1|) is no longer available. 
It turns out that this decomposition is merely a convenience - the asymptotic result of 
Theorem 14.11 is as sharp for all < r < 2. However the analysis of the error term is more 
complex than in Section requiring bounds developed in Lemmas 111.11 and 111.21 
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11.1 Empirical Complexity for ir loss 

For an ij. loss function, we use a modified empirical complexity 



K{il, y, r) = \\y - 



iV(/I) 



1=1 

The minimizers of empirical and theoretical complexity are defined, respectively, by 

fir = arg minj^K{fi,y;r), 
Ho = arg mm^K{fl,n;r). 

For £r loss, the quadratic decomposition of ()9.1|) is replaced by 

K{fl,fi + z) = K{fl,fi) + \\ij.-fl + z\\l - Wfi - flWr- (11-1) 

The key inequality 

K{fj,r,y) < K{fio,y), 

when combined with applied to both /x = fir and fiQ, yields the analog of (|9.3|) : 

K{ilr,fi) < K{ho, fi) + D{fir,fJ-o, At, y)- 

Setting 

5 = n-fir, (5o = //-/io, y = IJ' + z, (11.2) 
we have for the error term 

n 

b = D{fir,fio, /u, y) = \\So + z\\: - \\6o\\: - \\s + z\\: + = ^ di (n.s) 

1 

Thus, with Errr = E^D — Efj_Penr{pr), 

E\\jlr - iJ.\\r < K{fj,o,fi; r) + Errr{n, fir). (11.4) 

11.2 Maximum theoretical complexity 

The theoretical complexity corresponding to is given by 

71 k 

K(^o,/i;r) = inf (11.5) 
fc+i 1 

We may argue as in Section Wl^ that for 0„ = ^o[^?n]! 

sup K(/io,/u) < ~ Ktl^ ~ nr?„(21ogr/-i)^'/2 ~ i?„(G„;r). (11.6) 

and that for = [ry^] , 

n fc 

sup K{iXo, /x) = inf l-'/P + Rn{Qn.r). 
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Finally, for 0„ = £p [r/„] , we may argue that 

k k 

sup K{fio,fi) ~ max{^t[ : < nrjP} ~ i?„(e„;r). 

MG©n 1 1 

We remark that if k{fi) is an index minimizing (|11.5p . then //Qj is obtained from hard 
thresholding of fii at to = t\k{fj,)] (interpreted as ti if k{n) = 0). In any event, this implies 

If^Oil = l/^i - /^Oil < *i (11-7) 

11.3 The ir error term 

There is an £r analog of bound ()9.15|) ; this allows us to replace the random threshold tr by 
the fixed threshold value t^. for the most important cases. 

Indeed, suppose that < < tr- Let p-i{y) = r]H{yi,tft) denote hard thresholding at 
tre, and let 5i = m — fii denote the corresponding deviation. We claim that 

Indeed, 6i = 6i unless < \yi\ < tr- In this case, we have /ij = so that 6i = fii while 
P-i = Vi so that 6i = —Zi. In this case, ()11.8|) reduces to 

\^J'^\ - \yi\ < \zi\ 

which is trivially true since |/Ui| < t«; < \yi\. 

We now derive the £r analog of the error decomposition 1)10. 1(1 . Recalling the notation 
(fTT^ - (fm|) . we have 

di = di{ir) = |5oi + ZiY - \6QiY - \k + ZiY + ■ (11-9) 

Defining as in Section 1^31 the sets An = {t_ < tr < t+} and Sn{fJ.) = {i : \fJ-i\ < t-}, we 
obtain 

E^D = eY, d^ = E[Y^ d„ A„] + E[Y, di,An] + E[Y^ d,,Al] 

= E>an + T2n + T^n ■ 

Let di = di{tJ): the monotonicity of errors for small components (compare ()11.8|) ) says 
that the leading term 

Dan <E[Y^ di, An] = E[Y, d^] " ^[ J] di,A^n] = Dbn + Tin- 

Consider first the dominant term Dt,n- First, write 

Edi = E^\5oi + - l^oiT - + ^^iT + l^iT 

where, for y = ^ -\- z, z ^ N(0, 1), and < r < 2, we define 

Ma) = E[\a + - kr - l^r], (ILH) 

Ut, = E[ \rjH{y, t) - - \vH{y, t) - yr + \y- mII- (ii-12) 

[Note that a term E\z\^ has been introduced in both ipr and ~ as a result V'2(o) = and 
^2{t,fJ') = 2^(t,fj.) as defined at ()8.8() .] The next lemmas, proved in Section lll.Sl below. play 
the same role as Lemma 18.21 for the £2 case. 
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Lemma 11.1. The function V'r(fl) defined at 1)11. llj) is even in a and 

K(a)l<|^^!t (11.13) 
|C2|flr for \a\ large. 

Lemma 11.2. The function S^rit,/-!-) defined at ()11.12() is even in ^ and satisfies 

^r{t,0)=2 \z\''(l){z)dz, (11.14) 

J\z\>t 

\^r{t, /u)| < + 1] ijem,t>o. (11.15) 

With the preceding notation, we may therefore write 

DbnilJ-) = ^ TpriSoi) + Crit-,fli) = 1 5„(//) |^,. (t_ , 0) + Ti^ (//) , 

s„ i^J.) 

say. To summarize, we obtain the following decomposition for the error term in 1)11.4^ : 

4 

E^D - E^Penrifi) < -Dcn(/^) + Tjn{^^) 

i=i 

Dcn{^J) = nCr{t-,0) - E^PeUriflr). 

Dominant term. Using 1)8. 6p . 

er(t,0) <4fl>(t)[l + 2t-2]. 

Since 2<l(t_) = qnk^n^^, we obtain 

<r(t-,0) < 2qnk+tL + 8qnk+f_-^ < 2qnk^f_ + cfc„r;-\ 

since k+ — k^ < cknT~^ and t-(^) r^. 

For the penalty term, arguing as before yields 

E^PeUrifLr) > k_tL+0{knT:^-^). 

Combining the two previous displays, we obtain 

DM < {2qn - l)k.f_ + 0(A:„T^^-^) 

If Qn ^ 1/2, of course the leading term is non-positive, while in the case 1/2 < g„ < 1, 
we note from ()12.1ip and the definition (|6.37p of «;„ that 

k.{fl)f_{fl) < k+t^[k+] < Knf[Kn] ~ (1 " qn)~^knT^, 

which shows that DcnifJ') is bounded by the second term in the upper bound of l|4.2j) . 

Negligibility ofTin—T^n- Since there are at most k^ non-zero terms in Tin, from Lemmas 
111.11 and lll. 21 we obtain 

Tmifi) < CkJ;-'^+ =o{knT:^). 

To bound T2n, we first note from the properties of hard thresholding (compare (|8.13|l l 
that 

= |Ar,j - fJ'il < ir + ki| <tl + \Zi\. 
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Inequality ()11.7|) shows that \5oi\ < ti. 

Combined with H11.9() and H8.16() . this shows, for 1 < r < 2 

\di\ <3tl~^\zi\+6\zi\', (11.16) 

while for < r < 1 only the \zi\^ term is needed. Consequently, there exist constants Cj 
such that for < r < 2 

E\di\ < Cit^{~^'^+ , and ESf < C2tl^'''^^+ . (11.17) 

Thus 

\T2n\ < ^\^^\ ^ C|^n(^)l4'"'^+- (11-18) 

And so on ioVln]-, 

\T2n\ < Cnr]nt^{~^''+ = 0{knT^). 

To bound T^nif^), use (|11.17|) and Cauchy-Schwartz: 

TM < P{A'n)'^' Y.^Edjf/^ < cn4^-^)+ exp{-co(logn)V2} = o(i?„(e„)), 

i 

since An is 0^— likely. Argument similarly for T4„, with threshold at t-.{fj,) instead of t^. 

Weak ip. For the Ti„ term, we use a consequence of Lemmas 111.11 and 111.21 proved in 
Section [11.51 to bound the summands in Tin{fi): 

A{Soi) + Mt-,fii)-Ut~,0)\ <C[\fii\' At'{'^'>+] (11.19) 

Combined with Lemma [10.21 this shows that sup^^f^^] TinifJ-) = o^knT^l^)- 
For T2n, we use ()10.7p in ()11.18|) to obtain 

\T2n\ < Cnr^lf-H^r^^*-' = o(i?„(e„)). 
The analysis of T^^n and T^n is as for the case. 

11.4 Prom penalized to original FDR 

Proof of Theorem \9.S[ . Apply Lemma 18.51 with = fiu and fi' and t = ti. We abbre- 
viate the minimax risk i2„(G„) by Theorem 14.11 shows that for sufficiently large n, 
supe^ p{fLr,lJ') < coRn SO that the bound established by Lemma l831 vields 

sup |/5(/iz),/x) -p(Ar,/u)| < 2pntl + 2coI{r > l}Ri-y'{pntiy/'^ + 8nt[snpPf,{B'J^/\ 

(11.20) 

The thresholds to and tr corresponding to fio and fir both lie in [tctp], and so with 
probability one, 

N' = #{i : \yi\ G [iD,ir]} < #{i ■ ip < \yi\ < ic} < kp - ka- 

[The first inequality is valid except possibly on a zero probability event in which some 
\yi\ = to- To see the second inequality, note from (|7.1j) that I > kp implies Yi < ti < tp, 
while ()7.21) entails that I < kc implies 1/ > > tc- Consequently tp < Yi < tc implies 
kc < I ^ kp which yields the required inequality.] 

If we take /?„ = Sonkn, then Proposition 15. 11 implies that 

P^(S^) = Pf,{N' > I3n) < P^,{kp -kG> ^ankn) < coexp{-ci log^n}, 

so that the third term in ()11.20() is o{Rn)- Finally < cOnknT^ < canRn so that the 
first two terms are also o{Rn), completing the proof. 
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11.5 Remaining proofs for the £r case 

Proof of Lemma \ll.l{ The function a — > Iz+al*" is Holder (r) with constant C = 2 uniformly 
in z G M and r G (0, 2]. Consequently, so is /(a) = E\Z + . Since /'(O) = 0, 



■0r(a) 



/(a)-/(0)-|ar r<l 
Jo[f\^)-f'md^-H' l<r<2, 



and the global bound in H11.13I) follows from the Holder properties of /. 

For a large, + + |2;|''](/)(z)(i2; = o(l) and so it suffices to consider 

/oo /■a (^\~\ poo 

\{a + zY - a'\(l){z)dz <ar ^<j){z)dz + / (2z)Xz)dz < Ca'-^, 
-a J —a ^ Ja 

using (IHini). Thus, for a large, \ipr{a)\ < Ca"-^ + E\Z\' < C2a('^-^)+. □ 



Proof of Lemma Ml.SA Making the threshold zone explicit leads to 



Ut, f^) = [\y- /^r - br + 1^11 Hy - f^)dy + 2 \y- /i|Xy - ^l)dy. (11.21) 

J\y\<t J\y\>t 

From this, evenness and pi.l4j) are straightforward. Evenness means that we may take 
/i > 0. Let Cr = J \y — /^r</'(y — fJ')dy, and note that for y > 0, (j){—y — ^) < (p^y — Thus 

<2c, + 2 f\ff-y'\ (l){y-fi)dy. (11.22) 
Jo 

If r < 1, then l/i** — y'^\ < \y — fi\ + las follows by checking cases with y, fj, G [0, 1] and [1, oo] 
respectively. Hence |Cr(i;Ai)| < 2cr + 3 and (|11.14() follows. 

Now suppose 1 < r < 2, and that < /i < t. Break the integral in H11.22|) into two 
pieces. Consider first < y < fi: 

n^i' - f) (t>{y - f,)dy = r r dv v'-' f <P{y - fi)dy (11.23) 
Jo Jo Jo 



Jo 



<r {fi — wY ^{w)dw {w = fi — v) 



/•oo 

Jo 

Arguing similarly on the interval fJ, < y < t, 



{y'' - fJ''')Hy - l^)dy = r j dw' / (f){y - n)dy 



<r I + wY'^^{w)dw < cf~^ 







Combining the last two displays with H11.22() yields H11.14() . 
Finally suppose that fi > t. In (|11.22j) we bound 



(/i^ - y^) cPiy - f,)dy < (f,^ - f)^t - ^) + / (f - f)c^{y - t)dy. 
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The last integral is bounded as for the integral < y < /i at ()11.23() above (setting fi = t). 
Finally, write fi = t + x: since 



r //+^' v'^^^dv < C(2t)(''^^)+a; if t > x 



(t + xY -f < 
and X is bounded it follows that for some 



Proof of 1)11. 19() . First, recall from (|11.7p that |(5oi| < /Uj A ii. Lemma 111.11 implies the 
existence of a constant C = C(r) such that ?/'r(Aoi) < C'|/ii|^' A In view of (|11.15)) . 

it remains to show that 

\Ut,^^)-UtM<cw\'- (11-24) 

Rearranging (|11.21() leads to 

ir{t, /x) = - / lyIXy - ^l)dy + M'^t - fi) - <^{-t - ^)] + / 

J|j/|<t J|2+/^|>t 

Write — '?r(i,0) = Ui{fi) + U2{lj) + U^{^) corresponding to the three terms above. 

Consider first Ui. If |^| > 1, then < ^^ + ^4^/11 + z/^Y(f){z)dz < c|/i|^. For 

1^1 < 1, Uiifi) is with C/i(0) = U[{0) = and U" uniformly bounded in t. Hence 
\Ui{fi)\ < Cii'^ < Cl/^r for \n\ < 1 also. Clearly \U2{fi)\ < ^21/^1"', while U^ifi) is bounded 
and with Ul^ifi) = |t - ^|''(/>(t - ^) - |t + /i|''(/>(t + /i). Thus C/;^(0) = and for |^| < 1 and 
t > 2, C/;^' < C3. Hence C/3(^) < C4(/U^ A 1) < Csl^r, and (|TT311) is proved. □ 

12 Gaussian tails and quant iles 

We collect in this Appendix some results about the normal density (f), the normal CDF $ 
and the normal quantile function z(); these have been used extensively above. 

Lemma 12.1. We have 

(j){v) <{v + 2v-'^)^{v), v>V2. (12.1) 

More generally, Mills' ratio M{y) = y^{y) / 4>{y) increases from to 1 as y increases from 
to 00. In particular, 

(Piv)/i2v) <^iv) < (l){v)/v, v>l, (12.2) 

2^v) < e-'"'/^, v>0, (12.3) 

^{v-c/v) <4:e''^{v), v>V2^. (12.4) 

Proof. B ound p2.1l) follows from the standard lower bound ^{v) > {v~^ — ?7~'^)(/>(w), 
compare iFellerl (|l96^ . p. 175), and (1 - u"^)-! < 1 + 2^-2 for > 2. Monotonicity 
of Mills' ratio follows from differentiation and partial integration, and then evaluating 
M(l) = $(1)/0(1) = 0.655 > 1/2 yields ^T^. The difference g{v) = 2^{v) - e'^'/^ 
has g'{v) = [20(0) - v]e-''''/^ < iov v < 20(0) and vanishes at 0. But when v > 20(0), 
1)12. 3() follows from 1)12. 2() . Finally, (|12.4|) follows by applying the right and then left sides 
of (fTT^ . noting that t - c/t > t/2 and that 0(t - c/t) < 0(t)e^. □ 
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Lemma 12.2. Suppose that k and a are such that max{Q;, 1/a} < Clog A;. Then 

^2 log ka = ^2 log k + OVC, (12.5) 
and ifa>l, then 0<e< V2/e < 1, while ifa<l, then -1.1 < -Vs/e <9 <0. 
Proof. If a > 1, then from the inequality (1 + x)^/^ < 1 + x/2, valid for nonnegative x, 

fTT. fTT. T logo; /Clog(Clogfc) r- 

since u —>■ u~^^'^ \ogu < 2/e for u > 1. When a < 1, essentially the same argument applies, 
with inequalities reversed, and using (1 + x)^/^ > 1 + x, valid for —l<x<0. □ 

As an immediate corollary, we note that if fc„ = 6„n^~^ and max{6„,l/6„} < clog^n 
for < p < 1, then 



\^/2log{n/kn) - V2/31ogn| < l.l{c-^(3\og^-Pn)-^'^. (12.6) 

Lemma 12.3. (1) Let z{r}) = ^"^{rj) denote the upper (1 — 77)*'* percentile of the Gaussian 
distribution. If rj < .01, then 

z'^iri) = 21ogr7-i - loglogr/-^ - r2(r?), r2(r?) G [1.8,3], (12.7) 

z{ri) = V21ogr?-i-ri(r?) ri(r/) e [0, 1.5]. (12.8) 

(2) We have z'iji) = —l/(p{z{r])), and hence if ri2 > iji > 0, then 

^(r?i)-z(r?2)<^^^. (12.9) 

In addition, ift^, = z{i'q/2n) > 1, then 

-dt^/du = e/{vt^), e [i, 1], (12.10) 
and for <r <2 and t^ > 2, 

d{vtl)/dv = - r9t-^] > 0. (12.11) 

(3) //n-Mog^n < ry^ < b2n-''\ then 

263 log - 2 log 62 < < 2 log ra - 10 log log n, (12.12) 
and so for n > n{b), we have 

T^ = 2jnlogn 0<c(6)<7„<l. (12.13) 

(4) If Qn > bi/logn and v < b2n^~^^ , then for n > n{b2, 63), 

-3/2 < - V21og(n/i^) < 2(6163)-^/^. (12.14) 

(5) (i) For n > n{b), 

ti/r^ < c{b). (12.15) 
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(ii) If a < 1 and rjn < e then 

t[aK] > - 3/2. (12.16) 

If a < 6^^, the same inequality holds for rj < r]{p,6) sufficiently small. 

(Hi) If a > 7T~^, then for rjn < 77(7, p, 6i, ^3) sufficiently small (and not the same at 
each appearance), 

t[akn] < + c{hi,h'i), and (12.17) 
t\akn] < 2r^. (12.18) 

(iv) In particular, for a € Yit^^ ,5^^\, then as r]n 0, 

t[akrr] ~ Trj. (12.19) 

Proof. (1) Taking logarithms in the inequality 

and using $(z(?7)) = tj leads to 

21og(l - z(r/)~2) < z'^{r]) - 2 log r/"^ + 2 log z(r/) + log 27r < 0. (12.20) 
Now z'^{r]) < 21og?7^"'^ for rj < 1/2 and r] log(l — z{ri)^'^) is decreasing, so for r] < r/o, 

z'^iv) ^ 21og?7^"'^ — log log r]^^ + [21og(l — z{r]o)~'^) — log47r]. 

For rj = 0.01, the quantity in square brackets equals -2.94. Substituting this into the upper 
bound half of ffTT^ yields 

z'^iv) ^ 2 log rj^^ — log log r]^^ — g{\og r?~^), 

where the increasing function g{v) = log[27rv^^(2?; — log v — 3)] > 1.8 for v > logry^^^ = 4.60. 

Turning to the proof of (|12.8j) . note first that rj < 0.01 entails loglogr/"^ > 1.527 which 
implies the right inequality in view of the bound on r{ri) in 1)12. 7(1 . For the left inequality, 
we rewrite ()12.7() and appeal to 1)12. 5() . by setting k = rj^^ and a^^ = y^log i]^^e^'^^'^^^'^ . 
The error bound in (|12.5|) is (\/8/e)\/C', where 

1 e-'"('')/2 



a log k 0ogr/-i 

whenever r] < rjQ. With tjq = 0.01, calculation shows that (\/8/e)\/C < 1.4850 < 3/2. 

(2) Differentiating the equation r/ = ^{z{r])) yields z'{ri) = —l/(j){z{r])) which is decreas- 
ing in r]. Hence 

z{rji) - z{r]2) < (r?2 - m)/Hz{vi))- 

Since r/ = ^{z{ri)) < (l){z{r]))/ z{r]), (fT2^ follows. 

Prom the Mills' ratio remark in the proof of Lemma Il2. 11 we have z^{z)/(j){z) > 1/2 
for z > 1, and so for such z{r]), 

-z'{rj)=e/{riz{rj)), [1/2,1]. 
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Since dt^/di' = (g„/2n)z'(i^g/2n), this yields 1)12. lUj) . from which 1)12.11(1 is immediate. 

(3,4) Displays (|12.12|) and l|12.13|) are immedate. Bound (PTH|) says that t^- y/2log{2n/qiy) e 
[—3/2,0]. Apply inequality (|12.5() with a = 2/q > 1 and k = n/u to find that 

< ^j2\og{2n/qv) - ^j2\og{n/v) < c, 

where for n > n{b2, 63), 

n 2 21ogn 4 
c = < < . 

qnlog{n/u) &i(63logn - 62) 

(5) Part (i) follows from (|12.14|) and (|12.13|) . For part (ii) we first note, from (|12.14|) 
that for n > 71(62 , 63), 

t[aK] = V2logn/ (akn) + ^i, -3/2 < < 2{bib3)-^/\ 

Since n/kn = rfn^rJ^, we can write 

21og[n/(aA;„)] = + L{7],a), L{r],a) = plogr^ + 21oga~^ 

If rjP < e-^/^ then = 21og?7"P > 1, and if also a < 1, then L(r/, a) > 0, so that (I12.1H|) 
follows. If however we assume only that a < (1 — q')~^, then L(ry, a) > if is sufficiently 
large, i.e. if rj^ < r](p, q') is sufficiently small. 

For part (iii), from the bound ^Jx + e < y/x + e/{2y/x) (valid for < e < 3x), we find 

y^r| + L{'q, a) < + L(r/, o)/(2t^), 

and clearly L{r],a)/Trj < [2(p + 1) log + 2 log 7^-'^]/t^ < 1 for 77 < ri{'y,p). This establishes 
(|12.17j) and ()12.18|1 is a direct consequence. □ 

13 Proofs of Lower Bounds 

This final appendix combines ideas from Sections 6-8 to finish the Lower Bound result. 
13.1 Proof of Proposition 15.21 

From the structure (|5.12|1 of the configuration the total risk can be written in terms of 
the univariate component risks as 

p{fJ'H,t, fJ-a) = knPnit, nia) + (n - kn)pH{t, 0). 

From ()8.6j) together with the definition of t = t[akn], we obtain 

npH{t[akn],0) = aknqnt'''[akn]{l +0), 

with < 9 < 2t[akn]^'^. Since t[aA;„] ~ by (|12.19|) . we conclude that 

(n - kn)pHit[akn],0) ~ aqnknT^. 

In the notation of (|8.2() . 

PH{t,ma) = ■rrfj[^{t - ma) - ^{-t - ttt-q)] + E{ma,t), 

and as noted there, < E{ma,t) < Cr- In addition, mJJj<I>(— t — ma) < c^, and as noted 
at (|5.16() . ma — t = a + 0{t~^) and so $(t — ma) = <&(a) + 0{t~^). Consequently, since 
~ T^, we conclude that 

knPH{t,ma) ~ A:„<$(a)(l + o(l)). 
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13.2 Proof of Proposition lOl 

We use (|8.2|) to decompose the total risk 

p{jlH,t, fJ-a) = ^ D{pal,t) + E{HaUt) = D + E, 

I 

say. To bound the "bias" or "false negative" term D, we choose an index = nr^rria^ so 
that Jii^ = nia ~ Trj. Now decompose D into Di + D2 according as I < la or I > 1^. For 
I < la, we have identically ^ai = fna-, and so 

Di = laTTilX^it - rria) - ^{-t - rria)] = lama^{a){l + 0(1)) 

by the same arguments as for Proposition 15.21 And since rua ~ t^, we have 

The novelty with the weak-^p risk comes in the analysis of 

D2 = Y.finHt-fii)-H-t-m- 

The second term is negligible, being bounded in absolute value by <I>(t) Yli^i^ Ul = o{knT^). 
For the first term we have an integral approximation (with j2{x) = rin{n/ xY^^) 

-C2 ~ / Jf{x)(^{t - Ji{x))dx = plarrf^ j v''~P~^^{t - mav)dv, 

J la Jr]„/ma 

after setting x = laV~P. [Remark: to bound the error in the integral approximation, observe 
that if f'{x) is smooth with at most one zero in [a, b], then the difference between /(/) 
and / is bounded by sup[„^b] |/|.] 

For < u < 1, we have t — rriaV ~ (1 — u)t^ — > 00, and so from the dominated 
convergence theorem, the integral converges to Q v'^-P-^dv, so 

D2 ~ [p/ (r - p)]/am;; ~ [p/{r - p)]A:„r^. 

Putting together the analyses of Di and D2, we find that the false-negative term 

D=Ma) + [p/{r-p)]]KT;il + oil)). 

To bound the "variance" or "false positive" term E, decompose the sum into three 
terms, corresponding to indices in the ranges {hjh] and {h^n], where 

h = nrjP ^ = 1, and I2 = nrj^t^P ^ fli^ = t~^. 

We use (i) - (iii) of Lemma l8.1l to show that terms Ei and E2 are negligible. For Ei, the 
global bound (i) gives Ei < Crh = o{knT^)- For £'2, properties (ii) (monotonicity) and (iii) 
show that 

E{Uut) < E{fli„t) = E{l,t) < cof^{t-l) < cir;|.(r^ -5/2), 
where the last inequality uses t = t[akn] ~ and that t > — 3/2 from (|12.16|) . Hence 
E2 < l2E{l,t) < coknT^ ■ r3p|.(T^ - 5/2) = o(A;„t;). 
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Finally, we focus on the dominant term E3 = '^i^i^ E{fii,t). For / > I2, we have 
fll < T~'^ and t < CTrj so that 

^{t ± fli) = <i>{t) exp{^t/Zi - Jif/2} = (t){t){l + O(r-i)). 

Now if < and t = t[akn] < 2t^ (from then 

0(t -^l) = (Pit) exp(t/i - ^V2) = + 

and so 

Consequently, using we get 

E3 = 2(n - l2)f-^(P{t){l + O(r^-i)) = a(?„A;„r;(l + o(l)), 
using the manipulations of (|5.17p - H5.19() . 

13.3 Proof of Proposition 15.41 

Fix oi > 0. The idea, both for io[7]] and for mp is to obtain bounds 

M_(A;) < M{k;fia) < M+{k), k £ [mkn, a^^kn], 

for which solutions k± to M±{k) = k can be easily found. From monotonicity of A; ^ 
M{k;fia.), it then follows that A;_ < k{po,) < k+. 

In both cases we establish a representation of the form 

M{akn;fia) = kn[^ia) + eiT-^] + [l + e26{r]n)]qnk, (13.1) 

where \6i\ < c{a,a,b) and (5(r/„) ^ as ??„ — > 0. From this, expressions for k± are easily 
found and k± ~ A;„<I>(a)/(l — qn) is easily checked. 

For io[rin], (|TTrT|) follows from (j^^ and For mp[r] n] , we first formulate a lemma. 

Lemma 13.1. Let a and a be fixed. If \ f\ and \ f'\ are bounded by 1, then 

filial -t[akn\) = f{a) + 9t~^, \e\ < c{a,a,b). 

1=1 

Proof. For I < la = wq^iUia^, we have Hai = rna and so with t = t[a/c„], 

LHS = k-^ - + ^ill/lloo[l - 1 A {ljkn% \9i\ < 1. 

1 

From ()5.13p . we have 

/(m„ -t) = f{a) + e2\\f'\\oo{t[kn] - t[akn]) = f{a) + ^aVS 
with |i92| < 1 and |6'3| < c(a). From (112.1611 and (112.1711 we have ItHq^ — t^^ I < c{b) and so 

kn/la = {ma/r^T = 1 + 05 1^51 < c(6). 
Now combine the last three displays. □ 
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We exploit the division into the 'positive', 'transition' and 'negative' zones defined in 
Section 16.41 Applying the preceding Lemma with / = $ yields 

while from ()6.27|) and H6.29|l we obtain 

Mtrnik; fJ-a) = O^knT:^^ , and Mnegik;fJ.a) = [1 + d56p{en)]qnk ■ 
Putting together the last two displays, we recover ()13.1|) and the result. 

13.4 Proof of Proposition 15.51 

Let t' denote the fixed threshold t' = t[aQkn] where ao = (1 — g)~^<I>(a). We will use Lemma 
18.51 to show that 

\p{flF,tJ'a) - p{(lH,t',^i■a)\ = o(A;„T^) 

so that the conclusion will follow from Proposition 15.31 

To apply Lemma f8.51 set fi = jlp and /i' = fiH,t', so that the thresholds t = tp and 
t' = t' respectively, which are both bounded by ti. 

Let k± = k±{fia) = k{fj,a) ± On^n and recall that the event An = {k- < kp < k-^-} is 
mp[rin]— likely. If we set t± = t[k^], then on event An, 

N' = #{* : \y^\ G [iF,t']} < N" = #{i : \y,\ G [t-,t+]}. 

Hence P{N' > < P(A^) + P{N" > 

We use the exponential bound of Lemma lT.ll to choose /3„ so that P{N" > (3n) is small. 
From the definition of the threshold function, M{k) = M{k; fi^), we have 

M" := EN" = M{k+) - M{k-). 

Since k Mk is decreasing and using the derivative bounds of Proposition 16.61 we find 
that for n sufficiently large, 

{k+ - k^)M{k+) < M" < {k+ - k^)M{k^). (13.2) 

Claim. AI{k±; fia) = d[Qn + c{a)T^^] for some 9 G [1/2, 2]. 

Proof. We again use the positive-transition-negative decomposition, this time ol M i^akn', Ha). 
Write, with v = akn, 

Mposiakn] fJ-a) = {-dU/diy)'^(p{t - flal) + (p{t + fJ-al)- 

1 

From (flT^ and ()12.1U|) . we have -kn{dU/du) ~ l/(ar^). Applying Lemma [TO to f{x) = 
<j){—x), we conclude that for n sufficiently large, 

Mpos {o-kn ■^IJ-a) = {Oi/{aTr,))[(j){a) +02T^ ]■ 

Appealing to (HHHl) and 

Mtrniakn, Pa) = ^3^^' ™^ Mnegiakn, Pa) = [1 + 6'4'^p(en)]9n- 

Combining the last two displays and noting that k± = k{pa) ± a„A;„ correspond to a = 
0(a) (1 — qn)~^ , we obtain the claim. □ 
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Now set Qan = Qn + c(a)r^ ^ and select /?„ = SanhnQan- From 1)13. 2(1 and the claim, we 
have 

Olnknqan < M" < AanKqan, 

and so f3n/M" > 2. Consequently, 

P{N" > Pn) < exp{-(l/4)M"/i(/3„/M")} 

< exp{-(l/4)a„/c„Q'«n/i(2)} < cq exp{-ci log^ n} 

Now Pnti X anQanknT^ = o{knT^) , while ProDosition 15.31 shows that p{flH,t' , fJ-a) = 
0{knT^). So Lemma applies and we are done. 
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